THE IMPROVEMENT OF 
INTELLIGENCE TESTING 


BY 

HA3101.D H. ABELSON, Ph.B. 


Tkachers CottecB, Columbia Ukitebsitt 

CoNTRIBOTtOKI TO EDUCATION, No. 275 


Bureau or Pubucationa 

£^Iamb(a ^nibertfltp 

NEW YOBK CITY 

2027 



Pnntti i« f4r V^*d »/ ^mfTU* tt 

j j irrru jap (vu ceurixT stw toik 





FOREWOED 


Education may m a sense be regarded as the harnessing for 
individual and social purposes of the stupendous potentiality 
lodged m the brain of each person The highest realization of 
the intellectual capacity of each individual is dependent to a 
large eitent upon the rapidity and accuracy with which a knowl 
edge of that capacity may be ascertained The essential oh 
jective of research m mental measurements has been to increase 
the valid discriminatory power of tests without incurring an on 
economical increase of testing tune The present study offers a 
relatively new approach to the problem of test improvement 
It represents a tendency toward the selection and evaluation of 
test material m more and more nunute and elementary muts, on 
the one hand, and in terms of relatively more objective criteria, 
on .the other It sigmBes a movement toward the ecientifo 
method that may lead, it is hoped, to outstanding and perhaps 
startling improvements in psychological testing 
Those trho have followed certain of the recent developments 
in the work of Professor 'William A. McCall will clearly discern, 
how this study merely carries into a new field many of his 
thoughts on item analysis Only those who have worked with 
McCall and have consequently felt the influence of his friendship 
and stimulating advice, can know the full significance of my 
debt to him The professors and students of the psychological 
seminar of Teachers College, Columbia Umversify, have made 
many important suggestions Professors Rudolf Pintner and 
Ben Wood have given valued criticisms of the initial plans of 
the study To Professor Henry A Kuger I am indebted for 
lus very generous and helpful statistical advice The criticism of 
a section of the work by Professor Edward L Thorndike proved 
lery enlightening Conference with 2^1iss Harriet Barthelmess, 
eoworker in the field, resulted in the clarification of certain 
important issues 

-giTnip 'pmina "w/hi *hib -lA ‘vfA’j 

of New Tork are responsible for much help m the facilitation 



The Improvement 'of InielUgencs Testing 


vr 

of the investigation. To Professor Panl Klapper, Dean of the 
School of Education of the College, I am indebted for many 
instances of hearty and friendly cooperation. Professors Samuel 
B. Heckman, J. Carfeton Bell, and Egbert 11. Turner of the 
Department of Education of the College, rendered valuable 
service in connection ^ith the selection and the .administration 
of the tests employed. The Trilling assistance of Professor 
Horton Gottschall, Begistrar of the College, facilitated the col- 
lection of data on the coUege achievement of the students tested. 
The progress of the Tvork Tras materially accelerated by the nse 
of the initial test results determined by the Trriter in connection 
irith a study conducted under the Joint auspices of the School 
of Education and the School of Business and Civic Administra- 
tion, of the College, for Tvhich help I wish to acknowledge my 
appreciation. 

It is very difficult to express adequately my gratefulness to 
my wife, Lucie Bernard Abelson, for her ample assistance and 
encouragement throughout the course of the study. 


H. n. A. 



CONTENTS 


Cffirtn 

I Tue rKODlAI 

rrollcm and Approaffa 
Iljpothetiefll I)is«Ds.i on of rroblnni 

II Tnt Data asb Tsna Jkttul Tkatvdt 

Selection of Item for Etudjr 
Admmtitratlon of Itemi to College Entranla 
Detenu notion of College Sueeese Criterion Seorra 
TaLulatioai and Initial Compotatlona 

III Tna luraoTTUCvT op Sooatsa Tmtotoa Irxu Avaltsis 

Teehnlqne for Afalgnlng the Kew \ah:e« to Item Ke«FOc>ea 
nmiti and Tbeir Interpretation 

Fmpirieal Conp^rUon of the Old with the New Bearing Method! 
Tentatire Trial of Mod deatlons of the Method of Detcrmlalng 
New Ckoting Naluet 

n Tiik raosLsu OP ttis Cnoicc op mt Btxr Irrys 
Cbolee of the Item CoeQleleat 
Cbaraeterlillea of the Item CooOelent 
Fnptrleal Stud/ of tbo Belatlllt/ of tho lira Coefieirst 
I'taetleal EffeetUeneta of tho Item Coefiieeat la Choool&s tho 
Beit Item 

Modification of tho Item CotOeUnl 

Determination of the Objretire Faeteri Aaaoeiited with Item 
Goodsesi 

\ The Awaltbio or tve firrmrs 

Determlnatlcn of the Meajsres Emplo/ed 
KeiuU* 

M SrUUAET AKD COXCLTBIOXB 
^rraiMi 

I Broorjnrtovs roa DrcaaAaixo niB Laaoa ABaociAna with Itxx 
AwALTita 

II MuctujtwrorB SrrruwEWTAaT flaacLTa 
in linuooaAPHT 



THE IMPROVEMENT OF 
INTELLIGENCE TESTING 


CHAPTBE I 

THE PROBLEM 

PROBLEM AND APPROACH 

How may mtelligence testing be improved f Of the several 
possible wajs the following are selected for study in the present 
investigation 

1 The better scoring of tesponses to item stimuli 

2 The better choice of items 

In the past, bamng a few exceptional studies, items have 
been selected and responses evaloated on a highly subjective 
basis Here, more objective techniques are attempted More 
over, while in the development of mental measurements, entire 
exaroinations and subtests have been studied statistically with 
profit, any outstanding improvement of our present instruments 
must come as a result of the analysis of the item and of responses 
within the item. The teynote of the present study is, then, ob 
jective item analysis College entrance testing is made focal, but 
suggestive applications to all levels and to all kinds of testing 
may be made 

HYPOTHETtCAIi DISCUSSION OF PBOBLE1I3 

The writer has found it both stimulating and clarifying to 
ask his research students, when they bad selected their problems, 
the question, “With infimte but human resources how would you 
solve your problemsl” The application of this question to the 
problems on hand many prove belpfol, first, in presenting a 
broad outline of the work of the investigation, and second, as 
making possible a clearer enunciation of the assumptions and 
lunitaticns involved 

1 



2 


The Improtcmcnt of tniellxgence Teshng 


"With unlimited facilities, tlien, how might one determine the 
best scoring of item responses lo obtain the optimum college sue 
cess prediction! One answer follows 

1 Tcntativelj construct or select an infinite number and 
Tanety of test items 

2 Administer these items to a large sampling of college en 
trants, equating for each item such factors as fatigue, placement 
of the item, mental set, and so on 

3 Determine the success m college of each student after four 
jears (or possibly his later success in using college traming 
toward life adjustment) 

4 Analyze the possible responses to the stimulus of each item 
(These responses might be classified under scTcral types ) 

5 Compute the arerage degree of college success achiered by 
the students, grouped, for each item separately, according to 
their response (or type of response) 

6 Assign to each response (or type of response) the arerage 
college success score computed for it 

7 Assuming that items are selected with due regard to their 
mtercorrelations, this technique would theoretically gire the 
optimum predictire sconng to be used with subsequent similar 
groups of students 

And similarly the hypothetical solution of the problem of the 
best choice of items might be as follows 

1, 2, and 3 as abore 

4 Assuming the best Econng of items as indicated above, de 
tenmne the correlation between responses to each item and the 
entenon of college success, summanzing this relationship m a 
coefficient for each item 

5 Determine the mtercorrelations of each item with every 
other item 

6 By the appbcation of multiple regression formulie, it would 
then be possible to select the best group of items for the desired 
prediction of college success, provided, of course, that subsequent 
entrant groups were similar to the eapeninental group 

The above analyses present m a rough way, free from specific 
considerations and also from practical limitations, possible an 
ffwers to the problems stadied. Any practical and specific at 
tempt at solution must carry with it several modifications of the 
theoretically sound solutions, thereby increasing the number of 



The Problem 


3 


assumptions involved and lessening the probable effectiveness of 
a new method of selecting and scoring test items. The following 
chapters explain essentially the various methods attempted and 
the results obtained as regards, the improvement of the tests 
studied 



CIFAPTER 11 


THE DATA AND THEIR INITIAL TRE \TMEST 

SELtcnov or rrrus ron sttut 
"W ith the cooperation of tnembcra of the Department of Edo 
cation of the ColJepc of the City of New Vori:, five leading col* 
lc{^ entrance intclliftcncc cxamtnationa vrerc selcelcd to be ad 
ministered dunnp the fall of 1925 to the ineommff freshman 
class The chosen examinations were 

The Thorndike Intellif^rnee Examination for High School 
Graduates 

The Rohack Slcntality Testa for Superior Adults 
The Iiro^vn Unncrsiti ra>ehological Examination 
The American Council on Education Paycholopea) Ezamina 
tion, 1921 Edition 

Thurslone’a I’sj chologieat Examination IV (1919) 

After these had been administered as described m this chapter 
and their subtests had been analyzed as explained in Chapter V, 
the Items of certain anbtests were selected for intensive analysis 
The selection was made with the purpose of obtaining a variety 
of test types. Tabulated descriptions of the tests studied are 
presented on page 16, Chapter HI In all, 205 of the approxi 
mate 450 items contained in the above examinations were snb* 
jeeted to intensive item analysis 

ADSTunsniATioN or rvEMs TO coujsoB nxTRAjrrs 
The C21 lower freshmen who had entered the day session of 
the College of the City of New Torfc daring September, 192o, 
were employed ns subjects They were divided into four groups, 
hereafter termed A, B, C, and D Gronps A, B, and C were 
selected alphabetically The D group consisted of those who 
were absent from the regular testing period This group was 
tested some five weeks later Table I gives a tabulated descrip- 
tion of the groups and of the examinations employed with each 



The Baia and Thetr In%Udl Treatnxeni 


5 


Gio\ips A, B, and G are held to be aboat as comparable m 
samples selected by chance Qronp D includes, perhaps, a small 
number of students who may Lave attempted to avoid the tests 
This may account for the lower average CSC Score and the 
greater variability of this group Since greater variability tends 
to raise coefGeients of correlations, a slight downward correction 
would have to be made m comparing a coefficient based on tins 
group with those based on the other gronps 


TABLE I 

Data on the Expebiusntal Groups 


OtODp 

No of 
StadeatB 

EzAmlnotlons 

Alin Dies 

Aleui of 

S«rM' 

8 D of 
CSC 

A 

175 

Thorndike 

170 

50 57 

9 99 

B 

Sts 

1 Brown 
’ McCall 

70 





Mnlti nental (Ex 
perimeDtal Form) 

45 

49 68 

9 90 




115 



0 

137 

t ^mmeas Council 






IVst 1 (Comple 

tion) 

10 





^ Boback (Test S 






and 8 omitted} 

130 

49 53 

0 50 




140 



D 

61 

1 Thorstone IV 
’’ AneiietLs Conacil 

30 





Tests 

50 

45 38 

11 64 




80 




* See pp> 7 ff lor a full •cmoot of fbe deriTotlon of these acores 


Application to outside groups of the results found with the 
subjects employed may be limited by certain special char 
actenstics of the experimental i^np 

The group is highly selected along linguistic and academic 
lines Admission to the day session of the College is limited 
(barring special examinations rarely passed) to those who have 




The Jmprotemtni of InielUgcnee TetUng 


to ten percentage points lower tlian the Regents’ grades That 
the selection Is strenuously exelnsire xs indicated by the /act that 
roughly 40 or 50 per cent of the applicants of the two preceding 
semesters had been refused admission to the daj session of the 
college 

Quantitative e\ idcnce as regards the status of the experimental 
group and its variability in mental functions is made possible by 
means of comparisons with other colleges which participated in 
the testing program of the American Council on Education for 
the academic 3 ear of 1924 25 During this year about 950 fresh 
men at the College of the City of New York took the American 
Council tests The ranks of this group, compared with 59 other 
colleges, first rank being assigned to the highest average score, 
and so on, ranged between first and clcscntb A comparison 
between the 1924 group and the 1925 group employed m this 
study indicates a still higher selection in the case of the latter 
The mean score on the American Council Completion of the 1925 
group of freshmen was 2218, which is slightly more than the 
mean of the scores of the 1924 group The ezplsnation for the 
liighcr score of the more recent group is very likely the fact that 
the entrance requirements were raised and were more stringently 
enforced 

Since athletics and socisl life are mmmmed in this college as 
compared with other colleges, a more studious type of indindua] 
tends to seek entrance Moreover, an unusually large propor 
tion must engage m outside work in order to earn their way 
through college Although relatively few are foreign bom, a 
large proportion are of foreign bora parentage Practically all 
have their homes in New York City, and the larger percentage 
of them have bred m an urban commanity all their lives 

"While, with the exception of the D group, the groups were 
tested simultaneously and while the usual uniform test condi 
tions were maintained, it was naturally impossible to eliminate 
such factors as item placement, tune per item, mental set, fatigue, 
practice effect and the like Hence certain items might have 
been given the advantage over others as predictive agents be 
cause of the favorable operation of some one or some combination 
of these factors Except for the omission of several snbtests, as 
indicated, the directions of the respective test authors were fol 
lowed literally 



The Data and Thetr In\tidl Treatment 


7 


DETERSUNATION OF COLLEGE SUCCESS CEITERION SCORES 

The determination of measures of success at college necessarily 
falls far short of ^vhat one might do \7ith unlimited resources 
The truest criterion according to the best current educational 
theory would be the extent to which college education had modi 
fied the individual so that he might better adjust himself to the 
complete environment However, the only fairly objective meas 
urements available are the subject grades which the students 
earn In apite of the wide disparity between the fundamental 
criterion mentioned above and the college grades, the latter have 
been employed, and with the following partial, if not complete, 
jostiBcation 

1 They are the only feasible measures 

2 They are used at present as the basis of promotion, accelera 
tion, graduation, honoring of students, etc 

3 They are m line with the present aim o£ the college, aca 
denue and restricted, perhaps, as it may be 

4 The present study can aim only at the clarification and the 
development of techniques, rather than at the determination of 
ultimate results 

5 The criterion employed undoubtedly correlates fairly highly 
with the truest criterion, and since the comparisons of items 
rather than the determination of absolute values are aimed at, 
it IS quite likely that the results would remain virtually the same 
with respect to the truest cntenon 

The grades of the first semester alone were employed because 

1 Previous reports * of correlations between tests and college 
grades have shown no increase or only a moderate increase, in 
correlations when grades of additional semesters have been added 
m the computation of the cntenon score 

2 A large proportion leave college after the first semester, 
thereby further restricting the group or necessitating many snb- 
leetive judgments as regards college success or failure 

3 The utilization of grades of additional semesters would 
have entailed a delay in the work which neither was feasible nor 
seemed justifiable as indicated above 

In addition to the grades of each student in each subject which 
he had attempted, the amount of work attempted m terms of 

*The titles ot several taeli report* are listed to Uie BSblio^plij 



8 


The Improiement of InttUigence Testing 


crrdh» tTM also includfd Jn tb« wnjpatatJon of the eriterion 
ncsulls of a quMtJonnairc »tu(l> inwtlffatfnfj Into Ihc numljcr 
of houra of ouUjde vork, the amount of atmly for eollcpo work, 
and the like, were not uv-d beean^ of the hlRh rubj'^tivitj’ and 
the apparent maecuraej* of the rc^ponse^ The problem resolved 
itself, therefojT, into Ifce follomog tspfvte 

1 How to equate marks pten in different eouraca, haring 
etandards varying in aereritv. 

2 How to eombine the average equated grade with the num 
her of cretins nttcmptnl, or. In other words, how to combine the 
quahti with the quantitj of coHege aelilerement 

The marka were equated by mearu of the ao-calJed T scale 
technique, that U, bv reducing the distnbtiiion of the marks for 
each college subject to a single form, that of the normal fre 
queue} curve T values for the rarious letter grades arc given 
in Table II 


TAllLF II 

Titi T ^iiLru fxm rus \iitors Cotxrox Oitru Tiaxm tir \i»oes 
Cortaxa 



A 

0 

c 

n 

E 

P 

in 

V 

Aril 

tf 

6SJ 

49^ 

43 


3* 

395 

ITS 

Cticatitr? 1 

«* 

66 

47^ 

38 


SOJ 

25 

ISO 

Cli(tsUtr 7 la 

70 

at 

5-J 

435 


3411 

S6 

103 

IVonomirt I, S 

73J 

61^ 

50JI 

41 

33 

39 


103 

1 o(f{i«!i 1 

7: 

SSJt 

4’ 

3fZ 


S7Z 

21 

339 

Fnsluh 3 

Cl 

58^ 

49£ 

40Z 

34 

30S 


131 

r/M>elj I, 2, 3, 4 









S3, SI 

ass 

W 

Siz 

44 

3(15 

33 

25 

249 

1 meh 61 

as 

53 

41 

35 


S9 

27^1 

33 

Genoan 1, 3 

ae 

67 

45 

34 




83 

Ootrmincnt asd ITlf 









lory 

71 

69^ 

SO 

42 

41^ 

30 5 


eo 

nyglfoo 1 

't>Z 

62^ 

1 52 

41X 

30^ 

! 33 

' 20^ 


Latla 1, 3 

aa 

5S 

52 

46 

40 

37 

33 

"0 

Latin 61 S3 

esj 

S8J 

50 

435 

305 

33 

25 

90 

Mathematica 1, S 3 









4, 7, 53, 1 2 

62 5 

68^ 

52^ 

465 

41^ 

89J 

3eJi 

461 

Military Seienee 1 

61 

54 

44 5 

39 

34 

30 


S51 

Tbydet 1, 2 

as 

89^ 

53 

43 5 

35 

32^ 



rhiio«opby 1 


ar 

55 

<S£ 





mWie Sp«0kiBg 1 










69 

59 

50 

41 

34 

30 5 

22 


fipaotib I, 3 

67^ 

eo 

52 

4SJS 


34 


84 







Th& Data and Thetr Jnituil Treatment 


9 


The symbol (F) signifies the forced or voluntary dropping of 
a course by a student E signifies a condition , F a failure 
To facilitate computations, these values were reduced to a 
scale of from 0 through H, according to the following equiva 
lents 

T Score from 20 23 30 33 40 49 BO 55 60 65 70 75 
to 24 5 23 5 34 3 33 5 44 5 49 5 64 5 50 5 64 6 63 5 74 5 79 5 
neCaced Scale 

Vftlae 0123466780 10 11 

The numerical, equated, and reduced grades for each student 
were averaged, grades received m less important courses being 
given half weight These averages gave the "quahty" rating 
Each course has a given credit valne, one credit signifying the 
expectancy of two and one-half hours of work weekly on the 
part of the student Simply summing the number of credits 
attempted by each student resulted in the “quantity” ratings 
Since students varied m the amount of work undertaken, the 
quabty ratings alone did not seem adequate in offering a fair 
judgment of the students' work The two ratings were weighted 
60 as to give the best prediction of score on the Thorndike Ex 
amination Part I In order to obtain these weights the fol 
lowing measures were computed (with a sampling of 100 cases) 

CosrrrciBXTa or Conuuno't 6 ti*i{>jkd Sitiations 

Tborndlke Fart I wlUi Ooftlltr RattDrs(rl2) ^26 <rl) Xbomdlke Fart 1 271 
XtioradUre Parti «1Ri Qaaotltr BatlDtrsIr 13) 121 {r2) Qaalltr Battan 10 5 
Qualltr Batloea vita QaaaUtr EattsKs (rSS) 110 («3) Qoutlty Batlae* X>3 

These values were substituted in the formuliu 


and 6 '.. 

ivs pse-ivAwi^ Xx ft'im e. toAwkdse of 
ij and ®», &ia t and Ji» i being the r«pective weights of the last 
named variables The Xx x* and jc* refer to deviations from the 
respective means of the Thorndike scores, the quality ratings 
and the quantity ratings 

The values of a and bi» « are 57 and 2 3 respectively The 
weights actually assigned were about 2 to the quantity and 7 to 


— ra*)» U — f 
■ (1— ^u) <1— 


— « 4i + bu. 



10 The Jmproitment of Intelligence TetUng 

the qualjt> rating The resulting' composite scores were T scaled, 
cmplo>infj the four irroups These T scores were converted into 
plus ami minus deviations from the general mean The appro- 
priate deviation was then entered upon a narrow sfnp neat to 
the number representing each student 
Twentj three students had left college so that the usual 
ratings were not available tor them Nine had been dropped, 
ten had left late in the term, and four bad left early or with 
no indication on their records a.s to the time of leaving Several 
persons were asked to place these three types on a percentile 
scale in companson with the remaining stndenta The percentile 
rankings averaged for the respective gronps apprommately 5th, 
20th, and 30th The T raJaes corresponding to these percentiles 
are respectively, 3G, 43, and 47 The deviations from the mean 
of 50 arc consequently — 14, — 7 and — 3 These special scores 
were treated as siradar to the other CSC Seorea 
ResoUs found with the use of the C Seorea are subject 
to the partial ineonaisteoey of these scores An opprommate 
measure of their reliability was obtained by dividing the grades 
for each student into two parts, computing the qnabty rating 
for each part and dctemuning the cocfiicient of correlation be 
tween the ratings of the two divisions The resultant coeiEcient 
with a sampling of 100 students proved to bo 388 Bat the 
coeOIeient thus found represented the consistency between the 
halves of the CSC Scores By means of the Spearman Brown 
formula the probable rcliabibtjr coefCcient of one whole scole 
(i e , the score for a whole semester) with a second whole score 
was computed It proved to be J59 A test perfectly designed 
to predict the true C5 C Score, j e, the average score of an 
infinite number of single scores, would within chance variations 
yield a correlation coefficient of only ^59 with the single scores 
used in this study This is obvious since an external test can 
hardly predict test scores of a trait better than do similar scores 
t«t fifmv Atta* Tie Anr vaAvAv^ ci5«fi\n\asvli’ JiaViiv 

are more easily explained m the light of the above fact. 

TABIJIATIO'fS Aim cnriAi, comtutatiovs 
JtespoTtse Atutlpsts — The fourth step m the hypothetical sola 
tion of page 2 calls for the analysis of the significantly differ 
ent possible responses to each item stimulus Several coasidera 



The Data and Thar Intttal Treatment 11 

tions operated in this nnalj’sis First, it was necessary to obtain 
tbe scores of the student necordinf' to the test author’s method 
of scoring Ilcncc, responses credited by the author were re 
tamed under his classification For case in summing item scores 
to obtain test scores, numerical ^Tnbols corresponding to the 
number of points allowed in the scoring manual were assigned 
to these responses Second, it was advisahle to keep the number 
of response t^pcs low Therefore, a response likely to occur 
rarcl.i, ns determined hi a cursory inspection, or one only 
slighth dustinctise, nas classed with the tjTie of response most 
similar to it Third, in connection with tests too difficult to be 
finished by a considerable number in the time allotted, it was 
thought advisable to distinguish between two kinds of omissions, 

I e , omissions in tlie middle of the lest followed by an attempt 
later on in the same test, as distinguished from omissions not 
followed by a later attempt The difference involved is that in 
the first case the student, after attempting the item, was unable 
to make a satisfactory response, while m the second case, tbe 
student, whether able or not to answer coTTCctly, had insufficient 
time apparently to attempt the item Fourth, it was hold to be 
worth investigation to determine whether responses scored 
“wrong” by the author were of varying value Hence, where 
possible and feasible, attempts rated “wrong” were classified 
under several subgroups Fifth, especially in the case of sub 
jectively scored tests it was thought advisable to provide for 
doubtful responses In general, the fourth and fifth eonsidera 
tions Were subordinated to the second 
Tabulation of Response Symbols for Each Item and for Each 
Student — ^Thc effective study of item responses is largely de 
pendent upon the efficiency of the tabulation methods employed 
The tabulation techniques here used are presented with some 
detail, though briefly, in order that the principles underlying 
the new scoring dev ice migbt be more concretely understood 
The tabnlatjon form is illustrated in Figure 1 The numbers 
assigned to the students arc listed vertically at the extreme left 
The item numbers for the subtest studied are listed horizontally 
at the top Thus in the figure 20 students and 40 items are 
represented The symbols 1, a, X etc indicate the response of 
a student to tbe various item situations The scorer, as he 
reads the test booklet writes the appropriate symbol in the 



11 


The hnprm emrnt of InlelUgente Testing 


g fchH.#, 

% a.M«H»* HHHHH » « b '• H 

a«^H»l hbbbH 

HH-,.H kBBB- 

ij 


gHH«e< SB^oxHeHa* avaHii 
fiHHak-^ b«H«K>-b*-H»->io».,-H 
►'HHH*' Koe^** 

f, b»H^- b«^*<H 

gHaHHH hBh|*io *'»1 b~H Mt-WhiH 

f\KMB>4K ««H«H Kbbmm H.bH,. 
gH...... M.....M 

b«>H«<^ b».-«« H«bH>' 

g 

S f, «H-Hh B*.BM« HK-BB 

S.4B»>*4M H.,mH<h »<«-*bH -«mH« 

>; b»<H>4<-< b^mmm bmmmb 
SHHbo- HHMH- H---K HHMHH 


bmHbm bHmmK 

HHHm^ 



S bmHmH 

OrtMMBM r<MWBM »* — *< — •« 





m^is 


mouRig a.^tuBo, aoi ao kOuniMTt am oauvusaTii I aiioij 



The Bata and Their Jniiial Treatment 


13 


appropriate sqxiare. This raethod saves a considerable amoimt 
of time, making possible the scoring and the tabnlation of re- 
sponses in only slightly mote tune than the scoring according to 
the usual method ordinarDy takes, fieferring to the illustration, 
any one student’s responses to the various items may be noted 
by glancing along the row allotted to him, while the responses 
madehy the students to any one item may be studied by regard- 
ing the column assigned to it. 

The completed tabnlation makes readily available the numer- 
ous diversified calculations employed in the study. 



12 


The Improtemeni of InielUgence Testing 





Timm 1 iLMsmnNo Tm tamiatiok o.. Tnr Itt„ car SvMaota 



Zniprovcnicnf of Sconng Through Hem Analysis 15 


These responses are termed m order R, "W, and X Suppose 
ten students took Test A Assume further that their CSC 
Scores (not m terms of T deviations m this esample) were from 
lowest to highest, 1, 2, 2, 3, 3, 3, 4, 4, 4, 5 For Item 1 of 
Test A the results might have appeared as follows 

Stodemt BtspoHsa 0.3 C Sco&e 

1 R 4 

2 W 2 

3X3 
4 R 5 

5X2 

6 W 1 

7 R 3 

8X4 
fl W 2 

10 W 3 

Summery Response B of Item 1 was made by three students 
having CSC Scores of 4, 5, and 3 The mean of the CSC 
Scores equals 4 00 

Response W was made by four students having CSC Scores 
of 2, 1, 2, and 3 The mean of the CSC Scores equals 2 QO 
Response X was made by three students having CSC Scores 
of 3, 2, and 4 The mean of the C S C Scores equals 3 00 
Therefore, considering Item I by itself, to predict the 0 S C 
Score, theoretically, the best values to assign to Responses R, 
"W, and X are 4, 2, and 3 respectively 

The scoring values by this method are determined, then, m 
terms of the available measure of the function it is desired to 
predict This is one of the significant principles which McCall 
employed in the construction of his Multi mental Scale In that 
test no one of the various possible responses can be said to be 
correct absolutely, each is right or wrong to a degree The 
tsuyAtrytA in *A»is wapriififts, vih’/dEi, atsmd’ing 

to the author’s sconng, are for the most part planned to be 
entirely right or entirely wrong, m a few cases only are degrees 
of worth assigned 

RESULTS ANU THElB INTERPRETATION 
How do the scoring values detetmmed by the new technique 
actually compare with the values assigned by the test authors? 



CHAPTER HI 


THE BIPROVEMENT OP SCORING THROUGH ITEM 
ANALYSIS 

TECHNIQUE FOB ASSICNWO THE NEW VALUES TO ITEH 
BESPOSSES 

By the “new” scoring method employed in this chapter is 
meant the teehmque for assigning objective item response values 
as described below The word “new” is applied for want of a 
convenient name This chapter treats especially of the technique 
which at the outset seemed most promising Several other 
methods are briefly discussed at the close of the chapter 
The hypothetical discossion of the problem offered on page 2, 
Chapter I, roughly outlined the steps to be followed m the de- 
tenamation of the best scoring technique In Chapter H the 
procedure was carried through step 4 of the initial outline 
There it was indicated how the experimental items were selected 
and to whom and how thej were administered, bow the possible 
responses to each item were classified, how symbols representing 
the various types of response vrere tabulated for each student 
and each item and how, finally, the College Succession Critenon 
Scores were computed and tabulated for each student 

Considering a single item at a time, the final steps m the 
process are simply (1) to compute the mean of the CSC 
Scores of students gronped for each item according to the van 
ons response types of that item. Thus the CSC Scores of all 
those who made response A of Item 1 are averaged, -and so 
on , and (2) to assign to each response a value eqmi'alent to 
the mean of the CSC Scores associated with it 
The method is further ecplained in the Appendix for those 
who plan to conduct similar worh. A simplified illustration is 
presented here to show the general underlying principle 

Snppose Test A to be a true false information test Three 
kinds of responses are possible a correct response (according to 
the author’s scoring), an incorrect response, and an omission 
14 



16 The Improvemeni of InfelUgenco Testing 

Bcsults have been computed ■with snbtests described in Table III. 
The American Council Completion items are of the type; 

A (An) .. .. (8) is an nninamed (5). 

The numbers in parentheses indicate the number of letters con* 
tamed in the respective omissions The test calls for speedy 
responses and offers a variety m difficulty and m content. The 
new scoring wsponse values are tabulated m Table IV. Each 
blank to be filled in is regarded as an item- The forty items 
are numbered vertically at the left There are four types of 
TABLE m 

SuBtesTs Eutlotzd r» rni Coufasisov of Tnr Old with ths New 
Scouso Yaujzs 


Aea^enue In-j 
fonnabos 13 


responses type 1, the correct word according to the author s 
scoring key, type a, the insertion of a -wrong -word, of part 
of a "word or of more than one word, type X, an omission 
followed by an attempt later on m the same test, and type y, 
an omission with no later attempt The til column gives the 
amount and direction of the difference between the mean of the 
CSC Scores of the entire Group B and the average CSC. 
Score of the students who made response 1 In other words, 
it ejtpresses the deviation, in terms of CB C Scores, of the mean 
of the response 1 group from the mean of the entire group 
The nl column gives for eadi item the number of students 
making response 1 And so da is the deviation of the mean 
of the a response group from the general mean, no gives the 










Irnprovement of Scoring Through Item Analysis 17 


number of cases in tbe a group; and so on. The deviations 
are in T units and hence in terms of one tenth of the standard 
deviation of the C.S.C. Scores of the entire group. 

TABLE IV 

Besponsi Values m Teaus or C 8 C. Score, raEQumcfT or Bespouses, 
ANO aiEAK Value or BzsTOtsz Tma cr tax Aitzkicau Couucn. 
^jiruOToK Test 
(Based on Cases 1*137) 









18 The Improvement of JntelUgenee Tesitn{; 


TABLE IV (ConUnufJ) 


llna 

ex 


a* 


ex 

nJ 

4ff 


34 

04 

30 

- .50 

20 

.42 

46 

- LIT 


38 

SIS 

30 

. 61 

21 

. .83 

so 

-107 


36 

103 

6 

2.78 

12 

- 36 

86 

- .23 

33 

37 

— 48 

60 

Ha 

21 

.83 

23 

.50 

33 

3S 

100 

60 

Bza 

12 

- 90 

23 

.15 

42 

80 

— .47 

S3 

-2 00 

IS 

4.53 

S 

60 

61 

40 

3 06 

17 

14 78 

4 



- 

116 

Mean 









(weighled). 

.60 


-1^:5 


“ 


- IJO 



Table IV reveals the probable need for other than the sub- 
jeetne determination of /tcormp \ aloes The author jjssigns a 
credit of 2 for all responses, to the a, the X, and the y 
nyjponses w pnen n uniform lalue of rcro Still omitting the 
consideration of the grouping of items, bow effective for CSC 
Score prediction ta the antbor'a assignment of scoring values! 
Apparently the highest cffccti'eness is not achieved The truer 
salucs within the limits of error for this one group, correspond- 
ing to the author’s uniform single credit responses range all 
the way from minus C9 to plus 4 7S Tlie uniform zero values 
range from plus 14 78 to minus 1547. In two cases a response 
credited 1 by the author was, according to the more objective 
measure, inferior to the responses credited zero by the author 
In the case of twelve items, types of so-called “wrong” re- 
sponses were better than what the author terms “correct” re 
sponses The larious tjpes of “wrong” responses differ in 
value as indicated for each item 

The average (weighted) deviation values at the bottom of the 
table indicate the differences between the types of responses on 
the whole For this test it is m general an indication of about 
the same ability to omit an item because of lack of tune (re 
fiponse y) as it is actually to make an incorrect attempt 
(response a) and both these rwponses are worse than is an 
omission presumably made after a fruitless consideration of 
the item pToblem The few who are markedly slow (see column 
dy), reaching only as far as Item 31 or 32, are decidedly inferior 
"While the same general results would probably be found if 
the new assigned response values were perfectly reliable, it is 









Improiement of Sconng Thr<nigh 7fcm Analysts 19 


important to note that the \alues are conditioned by certain 
chance variations A discussion of the unreliability of the new 
response values occurs on pages 29, 44 ff 
"What the best method is of scoring true-false or multiple 
choice examinations which permit “guessing” has been and still 
13 a matter of contention among test constructors The purely 
mathematical treatment of the question resulting m the use of 
"VT 

general formulRi like “S = K — ~ ” or “Score equals the 

number correct minus the number wrong o\er the number of 
choices less one,” is fallacious, because a host of miscellaneous 
psychological factors enter to destroy the assumption that guess 
ing in the chance sense has uniformlj taken place The best 
method of scoring this type of examination can be determined 
only after much intensive empirical study Moreover, the best 
that can be hoped for is that such study will reveal sigmficant 
types of material for which formula?, applying generally to a 
given type, may be constructed Table Y and Tigure 2 show the 
results of the intensive analysis of the items of a true false 
academic information test, the Thorndike 1168 

The form of the table and the symbols employed are similar 
to those of Table FV The item numbers run down the column 
on the extreme left , tlie dR column gives the deviation in C 
Score of the “correct” response group mean from the mean of 
the whole group, the nR column lists the number making the 
correct response So with the other symbols TV represents 
“wrong” and X and y signify types of omissions as in the 
previous table 

The basis upon whicli this true false test is scored by the 
author is the usual one 3 being allowed for a correct response, 
a similar deduction being made for a wrong response, and a zero 
credit being guen for an omission The test includes materials 
from the fields of mathematics biology psychologj , geography, 
law, music, history, literature and emes 

Table Y and Figure 2 clearly indicate that the test author's 
scoring does not give to the item responses the values which 
would be best predictive of the CSC Scores assuming that the 
new values approach the true ones The unweighted mean of 
the “correct” responses is found to be minus 91T, that of the 
“wrong” response minus 123T and that of the omission re 



20 Thf Improvtmtnt of JniflUgtnce Testing 

TABLE V 

REsroHer Vitcia« t» Tzxua or CJIC. Pcotia avo ritqctscr of E* 
BFOVKia rOB THK fllXTT I»X« Of TJIt TooesOTXE TfSt 1105 

(Durd OB Cajfi 1 100) 

















Improtemeni of Scoring Through Itm Analysis 21 


TABL^ V (Conlinvfd) 



sponse phis 1 53T To confess a lack of knowledge proved to 
be predictive o! better college work than did making either 
a correct or an incorrect response The logically deduced scor 
mg does not provide for this tj pe of situation the empirically 
determined key does within the limits of chance variations dis 
cussed on page 29 below make such provision whether for better 
or worse In twenty seven instaoces the wrong* response 
receives a higher value in terms of college achievement than the 
correct ’ response These reversals may be due partly to the 
unreliability of the CSC Scores to the operations of chance 
to a low correlation between the item responses and the CSC 
Scores or to some unforeseen ambiguity in the item But the 
persistently high value of the omission response is a practically 
certain indication that chance error is not the only significant 
factor operatmg It is safe to say that there is an underlying 
value to each response in each item which would be roughly 
approximated for a large group studied by such results as those 
of Table V 

"With this test as with the American Council Completion the 







22 


The Iniproitmeni of InielUgence Ttsting 


few who arc especially slow (see column marked “dy") are 
markedly inferior in C S C Score The last three dy values 
arc of doubtful significance because, the last questions being 
highly technical, manj students maj have omitted all three, 
although the> actually had tune to consider them. 





BUtrtfeQtlon of ta« €X TaXu««. (The Buthor'v TBlttf !• 0 ) 

Fiocu 2 Tni pU, dW, oX \ALiTes or Tadix V ItmESEvmi 
GuratCALLT 

Tlie Tcrtiea] axis iodiestrs tbs Dumlcr of items falliag witbin Uie balf 
tuut T dsriation interraU deaoird aloni; tb« borixontal uis. 

The Thorndike 1103 is a three minute, ten item picture com 
pletion test Certain miscellaneous points are revealed with 
tlie help of the objective analysis of response values, tabulated 
in the usual manner in Tabic VT Ilespoases 2 and 1 are re- 
spectivelv totally correct and jiartially correct completions Re 
sponsc m IS a completion which, while not creditable according 
(0 the author s scoring key, scemeef to the scorer to he worth 
crediting Responses a, A and y bare the same coanotahon 
as m Table IV 

The results with Item 1, for example, are noteworthy The 
Item IS very easy, only four students having faOed to make the 
correct response, and yet the four who did not attempt a re~ 


Improtement of Sconng Through Item Analysts 23 


sponse ■were markedlj' aboAe the average of the group m CS C 
Score Evident]^ , some element m the item situation rvas pres 
ent to several superior students which is not apparent and which 
destroys whate\er value the item might have had unless better 
scoring values are assigned 

TABLE VI 

Besk)nse VAtuES IV TcBus OP CSC Scoszs AVD I^ouurer or He- 
SP0VSE3 ros THE ItCUS OP THE TnOKKIHEE Test 1103 
(Based on Cases 1 100) 



The separate tabulation of doubtful responses offers a means 
of corrective modification of a tentative scoring key Thus m 
Items 5, 6, and 7, the doubtful responses (which are available 
on the original test blanks) might well be included with the 
correct ones, while m Item 10 the doubtful response appears 
to be no better than the incorrect responses 
The values of the omissions not due to lack of time again com 
pare well with the other types of response The dy colnnm 
reveals some slight additional data on the elusne problem of 
test speed as an indication of eoUege success ability 

The Brown Test 1 consists of twenty items of the following 
type 

The of history has ro unless it helps to the present 

In contrast with the American Council Completion Test, the 
Brown Completion is characterized by a greater number of 
omissions to the sentence, the use of more eommon words, no 
indication of the number of letters contained in each omitted 



24 


The Improvement of Intelligence Testing 


word, and an emphasis npon the syntactical or grammatical 
structure o£ language rather than on diction The sentence and 
not each omission is regarded as the item unit 

Table VII has the same general form as Tables IV, V, and VI 
The author distinguishes between two qualities of “correct" re 

TABLE VII 

Eesponse Values I't Teems or CSC ScoErs Aim Feeouenct or Be 
cpovses fOE THE Items or the Bsowm test 1 


(Based on Cases 1 100) 



sponses Credits of 2 and 1 are allotted according to this dis 
tmction The response symbols 2 and 1 retain the sigmfi 
cance assigned by the author Any wrong response (mcluding 
omissions since they were few) is included under A The 
response values m terms of criterion aeores of the Brown Com 
pletion items show approminatcly the same degree of variability 
for each response type as do those of the other tests discussed 
above 



Impravevient of Sc&rtng Through Item Analysis 25 


The results of the analj^ of the Brown Test 3, a twenty 
item opposites test are indicatecl m Table VIII The items are 
of the form 

Jot grief snd sorry ride discomfitore, happiness scowl enjoy* 

The student m ashed to underline the word which is the exact 
opposite of the first word In the table response 1 designates 
the correct response, response A any other response The 
TABLE \in 

ErspoNSB Vix.VTS n. Teriis or CSC Scosrs akd Fbequbtct op Be 
BPON stB roa THE Itesis or the B&owh Test 3 
(Based on Cases 1 100) 


Item 


nl 

aj. 

ttJ. 

E 1 

0 00 

100 

oco 

0 

S 

18 

95 

2 so 

5 

3 

20 

73 

S3 

27 

i 

~ JO 

95 

3 00 

S 

5 

OS 

97 

— 2 60 

S 

6 

40 

99 

— 9 60 

1 

7 

33 

87 

—281 

13 

8 

85 

77 

— 366 

83 

s> 

— 07 

75 

40 

85 


165 

36 

— 93 

64 

F 1 

000 

100 

0 00 


2 

000 

100 

0 0(f 


3 

— 10 

90 

90 

10 

< 

— 06 

99 

7 60 

1 

5 

— 13 

97 

507 


6 

— 01 

94 

07 


7 

"3 

93 

— ■>89 


8 

— 01 

63 

0" 

37 

a 

— 80 

40 

53 

60 

10 

— 48 

54 

44 

46 


form of the table is similar to that of Table VI Because of 
the small degree of difficulty of most of the items little di 
rersity of response ralues is evident Withm a diimmshed range 
the general conclusions of the four preceding tables are sub 
stantiated 

The Eobact- Tests 1 and 7 were selected for item analysis 
* Selected from Uic practice fona ef tl>e teat. 



26 


The Improvement of Intelligence Testing 


especially because of the highly snbjective nature of their Ecor 
ing keys Test 1 calls for writing the general term which ex 
presses the general class under which each of a series of words, 
such as man, lion, amffiha and ant, is included Test 7 requires 
TABI.E IX 

Respovss Values in Traaia or C8C Scobzs and FaEQUENcr or E* 
sroNSES »oa the Itcms or the Kobacs Test 1 
(Based on Caeoa 1 100) 





Improvement of Scoring Through Item Analgm 27 


the subject to Tcnte in the word that is the exact opposite of 
each printed word 

The types of responses for which CSC Score values are 
presented in Tables IX and X are as follows correct according 
to the author , partiallj correct according to the author , doubt 
fill according to the scorer , incorrect word , omission, with later 
attempt in the same test, and omission with no later attempt 
The symbols are as before 2, 1, m, a, A, and j m the order 
mentioned 

TABLE X 

REsrovsE Values i** Terms or C5C Scores ak» Frsouexcies or Be 
spovsta roR rar Items or toe Bocack Test 7 


(Based on Cases 1 68) 


Item 

rf2 

n2 

dl 

si 

dm 


da 

no 

dl 

»I 

dy 

t*V 

1 

— SO 

47 



141 

19 

—499 

2 





2 

2( 

64 



—3 24 

4 







3 

101 

2 



—3 3; 

IT 

13: 

40 

1 11 

9 



■4 

—107 

19 

—2 43 

18 

3 It 

6 

ns 

22 

6 5: 

3 



S 

— 1 32 

30 



206 

29 

—3 49 

8 

7 51 

1 



Q 

-4 78 

7 



143 

12 

— 79 

44 

8 80 

5 



7 

—171 

9 

134 

6 

193 

7 

94 

35 

—3 6: 

11 



8 

12fi 

48 



— 3 4S 

3 

—3 70 

24 

—4 U 

3 



9 

—3 92 

7 

—6 49 

1 

— 2] 

27 

— o; 

22 

4&“ 

11 



10 

— 05 

57 



—2 49 

3 

416 

6 

—7 49 

8 



11 

—2 39 

21 

2 81 

13 

93 

25 

67 

6 

1 84 

3 



12 

— 149 

4 

— 49 

5 

150 

46 

-^27 

9 

—7 49 

2 



13 

66 

26 



151 

1 

74 

35 

—616 

6 



14 

—312 

8 

153 

41 

—118 

13 

—101 

6 

— 4 32 

6 



15 

14 

54 



—5 09 

5 

1‘’6 

8 

7 51 

1 



16 

3 88 

8 

2 01 

4 

51 

21 

— 94 

36 

18 

9 



17 

—2 64 

13 

3 81 

10 

142 

11 

— 49 

38 

76 

4 



18 

— 86 

59 



—2 49 

3 

19 51 

3 

76 

4 



19 

~ 36 

46 



— 90 

16 

124 

11 

T51 

1 



SO 

— 06 

11 



— 19 

33 

— 04 

80 

— 16 

3 

7 51 

1 

21 

—6 43 

13 



— 94 

20 

189 

24 

3 81 

10 

7 51 

1 

22 

—6 99 

2 



— 22 

56 

134 

6 

3 51 

3 

7 51 

1 

23 

— 10 

54 



— 05 

5 

3 01 

4 

—1 99 

4 

7 51 

1 

24 

— 58 

11 

319 

15 

—149 

IS 

— 1S8 

23 

— 99 

4 

831 

2 

25 

—2 34 

7 

—2 06 

14 

197 

26 

— 12 

16 



- 78 

5 



28 The Improvement of Intelligence Testing 

Several outstanding features of Table IX are 
The number of 1 and m responses is so few as to mahe 
a separate treatment of these types infeasible 

The CSC Score values of the 2, or “correct” responses 
are no higher in general than those of the a, or “incorrect” 
responses 

The omission responses, X and y, are in general higher 
in C S C Score values than any other responses 
The one student who completed only slightly more than half 
the test was markedly superior m college ability, as were the 
three others who were able apparently to get no further than 
the twenty eighth item 
Table X reveals somewhat similar results 
The “doubtful” response values are m general superior to 
the “correct” response values 

The a response values arc on the average about equal to 
the tn response values and shgbtly below the X response 
values 

The two students who were unable to reach as far as the next 
to the last item were decidedly superior m C S C Scores 
The relatively low indication of a definite trend in the C S C 
Score values of the various response types refiects m the cases of 
both tests the low correlation between the CSC Scores and the 
scores on these tests Individual items, however, show evidence 
of worth as indicated by Tables IX and X and by the treatment 
m Chapter IV of some of (he items 

EiirmicAn compakison op tiie old ivirn the rnnv 
scoumo methods 

That there exist wide and varied discrepancies between the 
usual and the new scoring values is clear Disagreement, how 
cier, proves the case for neither method of scoring Only an 
empirical comparison of the two methods will } leld any definite 
evaluation Such a comparison has been attempted, but it will 
probably prove helpful in the interpretation of the coraparafivu 
results to consider first the more significant sources of error 
associated with each of the procedures 

The evaluation of the item responses on the usual basis m 
volves in the mam two inaccuracies that of assigning a single 
value to responses that actually merit a wide range of values, 



Improtemcnt of Scoring Through Item Analysis 29 

nnd that of omitting to take into account unforeseen psycho 
logically effeeti\e elements m the item or test situation 

Although planned to minimize these difficulties, the new scor 
ing ■values are not entirely free from error First, the values, 
being dependent upon the criterion scores, are subject to the 
unreliability and the possible invalidity of these scores Second, 
it IS at tunes impossible, at times infeasible, to analyze the pos 
sible responses into all of the significant groups Third, values 
determined with one sample of a population are subject to errors 
of sampbng or to errors due to changed conditions or varied 
selection, when applied to a second group 

Certain other sources of error are common to both procedures 
It IS not now clear ■which procedure suffers most through them 
Thus whenever an item involves guessing, the scoring values 
tend to become less significant Woreover, neither the old scoring 
procedure nor the new technique as emplojed to this point, takes 
into consideration the mtercorrelations among the items of a 
test The question of unreliability of the new scoring ■values is 
again discussed on pages 44 if in connection with a consideration 
of the unteliahiUty of the new item coefficient 

To determine the relative eCTectiveness of the errors associated 
with each of the scoring procedures, and consequently the rela 
tive worth of these procedures, the following steps were taken 

1 Determinmg bj means of correlation with the CSC Scores 
the validity of a given eubtest, when scored accotdmg to the 
author’s method 

2 Rescoring the items of the subtest by the new method 

3 Determining the vahditj of the subtest when scored by the 
new method 

4 Comparing the old method validity coefficient with the new 

5 Interpreting the relative size of the coefficients in the light 
of associated data 

To avoid an unfair companson, it is necessary to determme 
validity coefficients for both the new and the old scoring with 
a group other than that used m the computation of the scoring 
values Unless definitely otherwise indicated these vahdity 
coefficients are worked out with a different group from the one 
used in determining the scoring values To facilitate the ex 
position of the results, the term “basal” is applied to the 



30 The Improvement of Intelligence Testing 

group with which the scoring values were computed, while the 
word “trial” refers to the second group upon which the com- 
parative validity coeflScients were computed. 

Part of the evidence as to the value of the new scoring method 
is presented in Table XI. In that table, column I gives the 
names of the subtests studied, column H, the number of items 
TABLE XI 

Bata ov rnr Empiricai. Etaloatjov or Tin Om and Xew Scorn o 
Methoos 


I 

Sabtnt 

II 

bo of 

111 

Old 

Val. 

Otwf 

AD 

IV 

V 

Old 

Scor 

Baul 

Croup 

VI 

bo. of 
Case* 

VII 

Old 

Scor 

VaL 

Cocf 

TrUl 

Tin 

New 

Vat 

Coef 

Trial 

IX 

bo of 
CasM 

Amenean Cooneil 1 
Completion 

40 

J152 

137 

J61 

C3 

.310 

.315 

69 

Thomdtlie UGS . . . 
Aeadenue Infer 
matiea 

60 

.030 

175 

.010 

100 

.206 

— on 

75 

Thoradike 1103 .. 
Putore Complettoa 

10 

042 

175 

061 

200 

.OSl 

.059 

75 

Brown 1 

Completion 

SO 

J130 

SIS 

I .230 

100 

177 

jiej 

100 

Brown 3 . ........ 

Oppostes 

20 

.250 

SIS 

163 

100 

.217 



Boback 1 

Abstnetion 

30 1 

060 

137 

— 193 


j:ss - 

-iisl 

69 


contained in each test, column HI, the coefficient of correlation 
between the author’s scores and the CSC Scores, employing all 
the cases as indicated by the number in column IV. Column V 
contains the correlation coefficients between the author's scores 
and the criterion scores with only those cases employed in 
determining the new scoring ralaes The number of such cases 
IS given m column VI Columns VII and VIII give the signifi- 
cant validity coefficients according to the old and the new method 
of scoring, respectively 

Table SI indicates the difference between the size of the 
validity coefficients with scores based on the old scoring method 
on the one hand and with those based on the new technique on 
the other The table also indicates the reliability of these 
differences. Thus the results with two tests point slightly in 



Iniprovemeni of Scoring Through Item Anolyas 31 


favor of the new technique, two slightly against it, and two 
markedly against it The results with these last two tests alone 
are sufficiently reliable to show that a true difference quite cer 
tainly exists Thus far, then, with the material employed in 
this study the use of the new scoring technique has failed to 
produce any significant improvements in the terms of the sue 
of the validity coefficients, and in two cases it has apparently 
caused a marked deterioration 


TABLE XU 

Estwseh thi Olb ami> Law Vausitt CoimenuTs and ths 
HEUABiLirr or the DirrEsrHCES 


Snliteit 

niffereare between 
Old and hew \a 
lldllj OoeIBcleDta 

rs of 

nut — 

Cbaneea 
in 100 of 
True 
Plffer 

than Zero 

ttt Faeoc 1 
of o>a 

la Favor ' 
of New 1 


DUX 

Ainericaa GoubuI 1 
Completion 

1 

005 

J02 

049 

51 

Tbotndike IIQS 

Aeadeuto lotormation 

1 217 


107 

8 028 . 

94 

Tliomddie UGS 

PietDTe Completion 


038 

106 

' 356 

59 

Brows 1 

Completion 

031 


003 

333 

59 

Brown 3 

Opposites 

037 


090 

ADO 

53 

Boback 1 

Abstraction 

406 


1 110 

3 601 

99 


If it were possible to locate the cause of the results with each 
test, it might he possible to'state at least under what circum 
stances the new technique might prove of value 

The results reported in Tables XI and XTT m addition to 
otlier data give some tentative suggestions as regards the solu 
tion to this problem Following are certain outstanding con 
elusions relatuc to the attempt to discover the factors associated 
with the effectiveness of the new scoring method 

1 The number of itenu contained in a test seems to have no 
significant association with the effectiveness of tlie new scoring 
technique 

2 Excepting for the fact that no test having a high old seor 


32 


The hnpro\.ement of Intelltgence Testing 


ing ^nlidity as computed with all the cases* showed a marked 
deterioration when the new method was applied, there seems 
to be no clear e'udencc of the association of this validity measure 
and the effectiveness of the new method 

3 The number of cases with which the new scoring values were 
determined did not \ar} sufficiently to give any indication of 
the effect of this factor upon the value of the new procedure 

4 The effectiveness of the new scoring technique does seem 
to depend upon the relative size and direction of the old scormg 
V alidit) with the basal group as compared with the old scoring 
validity with the trial groop The two serious reversals indi 
cated in Tables 'Kl and XII seem to be due to the fact that the 
basal validity coefScients are markedlj below the trial validity 
coefficients 

Substantiating evidence of this relationship was obtamed m 
the instance of the American Council Completion Test by com 
pnting nen scoring values with the tnal or second group re 
scoring the test responses of the basal or first group on the basis 
of these values and then making the usual comparison of the 
validity coefficients based on the two methods Tbe results were 

Old Seonog V&hditj with Ctses «n nbieb hevf 

SeoTxog Tala«a Were IletermiBed 340 072 

Old S<oruig \aliditr mtb a Second Oroop ISIS: 079 

\err Scoring Validity with a Second Group ^52 077 

Associated with tbe use of the new technique is an improvement 
of 092 in the validity coefficient 

Only m the instance of the Brown Completion Test did the 
new scoring validity coefficient as compared with the old scoring 
validity coefficient with the trial grOup more distinctly opposite 
to the old scoring validity coefficient with tbe basal group Tbe 
dissimilarity between the basal and the tnal groups appears to 
be a significant determinant of the relative success or failure of 
the uew scoring method dissimiZarity is roughly indicated 

bj the difference between the old scoring validity coefficients 
when computed with the basal group on the one hand and with 
the trial group on the other Tbe following coefficients indicate 
the effectiveness of the new technique with tbe American Council 
Completion Test when the tnal group is identical with the basal 

‘ 8«e Table SI eolamn in 



/wipratfjncn# of Seortnff Through Item Analysis 33 


group (and incidcntly wlien tlie size o£ the basal group is 
increased) 

Old Scoring \al ditj' Coetneient (137 eases) ~52:i; 0o4 

N ew Scoring \ alidltj’ Cacfiielent (137 eases) 511 043 

Since approximately the highest possible coefBcient (within the 
limits of chance \anatiOD) is 559 (due to the unrebabilily of 
the criterion scores) , the amount of this difference is remarkable 
Iloweier, since practically no two groups ever reach even ap 
proximate identity, the improrcment, practically considered, 
offirs little encouragement 

The new sconng technique seems to have failed essentially 
because of the lowness of the old scoring validity with the basal 
group and because of the dissimilarity between the basal and 
the trial group If the new method is to be of practical utility, 
it must prove its worth with groups as di&sinular as those here 
emplo> ed but it ought to receive a trial with tests showing higher 
validity coefficients than those used here This implies expen 
mentation in a field where criterion scores that are highly reli 
able can be found 

TE>TATITE TBIAU OP IIOOmCATIONS OP THE METBOO OP DETERMIN 
mo KEW SCOWNO VALUES 

Kotwithstnnding the fact that the new scoring method seems 
thus far to have proved unsuccessful largclj because of the un 
reliability of the criterion scores it might prove serviceable for 
future work to note even with the use of the same scores the 
effects of modihed scoring procedures As material for this 
ptelimmar> study the ten best items of the Thorndike True 
False Academic Information Test as chosen by the modified 
Item coefficient described on page 60 were employed The dis 
tnbulion of the CSC Scores of those making each response 
for each of the items is presented m Table SVIII together with 
the frequency of each response and its mean CSC Score value 

In the table the CSC Score value m terms of the deviation 
from the approximate median is indicated vertically at the left 
The frequency of the scores for the entire group of 100 is listed 
in the next column The remaining columns give the frequency 
of the CSC Score for each response group The total fre 
quencies and the means of the vanons groups are indicated at 
the foot of each column The tabulation ts curtailed 




represent best the desired raloe It is concejvaWe that m this 
special type of sitnation, some other measures like the median 
of the Qj might serve better These two measures are usuall} 
less reliable than the mean but eliminate in common the empha 
sized effect of the extreme cases The Qi in addition tends to 
disregard the distribution of the inferior cases who may be 
expected in general to fall very often by chance into the re 
spoDse groups A third measure permits those having a C S C 
Score above tl e median critenon score to judge as it were the 
order and the degree of value of the various responses by assign 
mg to each response a value proportional to the namber of the 
superior group making the given response 

The various measures of response valnes described above were 





Improvement of Scoring Through, Item Analysts 35 


computed for the ten items In the case of practically all the 
items, the difference betrreen the lowest and the highest was 
reduced to five units Scores on the ten item set were deter- 
mined according to the various proposed rescormg methods and 
validity coefficients, correlating these scores with the CSC 
Scores, were computed The validity coefficients of correlation 
were 

Author ’a Scoring 210 d: 075 

Mean as Ee«ponse Value 112 ± 077 

Q, as Response Value 139 :t 077 

Median as Response Value 173 ± 076 

Proportion of Superior as Response Value 085 077 


The same 100 cases was used, the ten items were the same, 
the only variant was the method of scoring In the case of the 
proportional method of assigning scoring values, two items were 
practically undifferentiating, and hence the score was determined 
with only eight items The coefficients are computed with a 
group other than that employed m determining the various new 
scoring values 

That reducing the difference between the highest and lowest 
response value uniformly to five for each item was an improve 
ment over the varied differentiation used heretofore, is indicated 
by the following pairs of validity coefficients of correlation com 
puted With new scores based on the ten items employed above 


VsUditj Coeffieiettti Rmplojiug the Meaa mlh T7n 
Uiuited PifferentiutiuD 

\aI 1 d 1 t 7 Coefiicients Emplojuig the Mean Tnth Lun 
ited Differentiation 

Validity Coeffieienta Rmplojuig the Q* iritfa TJnlim 
ited Differentiation 

V alidity CoeSeienta Employing the Q, mth Limited 
Differentiation 

Validity CoefBcienta Employing tbo Median with Ua 
limited Differentiation 

Validity Coefficients Emptojing the Median with Lim 
ited Differentiation 


097 ± 077 
112 :t 077 
108 ± 077 
139 ± 077 
139 ± 077 
173 ± 076 


The partial evidence here presented points then to the use 
of the median as the best of the response value measures and to 
the approach at umfonn weighting of items by limiting the dis 
tance between the lowest and the highest response value to a 
set nnmber of units A third tentative suggestion is to employ 



36 


The Improvemei^ of Intelligence Testing 


percentile rank units in the measures of the criterion trait. This 
\^onId minimize the effect of the extreme cases and yet permit 
the calculation of the mean. This suggestion is unsupported 
any empirical evidence as to its trorth. The new methods tried 
TTith the ten items have Tailed to excel the old method, but point 
to possible added lines of approach to the problem of the im- 
provement of the scoring of tests. 



CHAPTER IV 

THE PROBLEM OF THE CHOICE OF THE BEST ITEMS 

CHOICE OP THE ITEM COEFFICIENT 

The introductory chapter presents m rough outline the hypo 
thetical solution to the problem of the choice of the best items 
for a subtest or an examination The theoreticaUy perfect 
procedure ivould involve the determination of the vabdity co 
efficients of all available test items, the computation of the inter 
correlation of each possible pair of items for the entire examma 
tion and the building of regression fonsuls (with accompanying 
regression, weights) for the thousands of possible combinations 
of items The use of mtercorrelations of all pairs of items is 
obviously outside the pale of practicality, and hence a substitute 
must be found A feasible substitute is commonly used by test 
constructors , namely, to regard all the items 'witbm a given sub 
test score as the element in determining regression weights An 
other feasible substitute involving a modified form of the prin 
ciples underlying the mtercorrelations of items is developed 
later on page 49 But m the mam. items must be selected on 
the basis of their validity, that is, on the basis of the effectiveness 
with which they predict significant criterion scores 

It IS therefore essential that test builders be equipped with 
a sound method of computing a coefficient to represent the 
validity of an item In the search for a coefficient the first 
thought IS to attempt to apply the various coefficients of associa 
tion or correlation used m usual statistical treatments Vincent, 
after considering various possible measures, decided to use the 
method of overlapping 

The most obvious deficient of that method is that it may be 
applied to onlj a twofold classification of item responses The 
distinct need was felt for a coefficient that would apply to items 
yielding three or mote categories of responses as well as to those 
vieldmg only two 


37 



38 


The Improvement of Intelligence Testing 


McCall has invented a coefBcient which elnmnates this de 
ficiencj It IS an outgrowth of the principle of assigning scoring 
values according to the mean of the criterion scores of the group 
making a certain response The size of the coefficient is de- 
pendent in part upon the distances between each pair of response 
means The second determining factor is the product of the 
frequencies of the responses taken m every possible combination 
of two These two factors are combmed to give the numerator 
of the item validity formula 

„ + + etc 

^ «D Xis* 

where 

C IS the coefficient , 

Jit, Mj, Ms, etc are in their order of size from highest to 
lowest, the means m terms of criterion score of the groups 
making respectively response 1, 2, 3, etc for the given 
item, 

Nx, N., Nj, etc are the frequencies of the respective re 
sponses, 1, 2, 3, etc , 

S D dist IS the standard deviation of the entire group of 
which the response groups arc component parts, 

Ni IS the square of the frequency of the entire group 
Letting dii represent JL — M», etc , the formnla may be more 
conveniently written 

c = 


An empirical study of vanons possible item coefficients in 
clndmg the above is being made by JIiss H Barthelmess At 
the time it became necessary to select an item coefficient for the 
present study, JIiss Barthelmess very kindly gave her advice on 
the basis of whatever data were then available in her study The 
data pomted to the superiority of the coefficient described above 
and hence it was selected for nse 

The onginal purpose of this phase of the study was to dis 
cover the relation between item vabdity and certam other char 
actenstics of items, such as consistency with other items, diffi 
cnltv, form, and so on. Before any such relations can be safely 
reported to exist or not to exist it is essential that the effective 


The Problem oi the Choice of the Best Items 


39 


ness of the item coefBeient be examined It is furthermore neces 
sarj- to note any spuriously associated elements in the measures 
to be related, such as the validity coefficient and the difficulty 
measure Consequently, a consideration of the characteristics 
of the coefficient and an empirical evaluation of its effectiveness 
are essential 

CHARACTERISTICS OP THE ITEU: COEFFICIENT 

1 Bliss has shown that if N* is used in the denominator, co 
efficients with groups of vaiying sizes become comparable The 
algebraic proof 13 simple 

C« -jj; 

Assunimg first, that the deviations between pairs of response 
values are constant, second, that the standard deviation of the 
entire distribution remains identical with the increased number , 
and third, that the response group frequencies remain propor- 
tionally similar, then, increasing the size of N by a results m 
the following 

n ^(0’h)(erh) + 

_ 0 («N)* 

4- 

““ a MV 

Dividing both numerator and denominator by a*, the original 
formula ensues 

However, the first of the above assumptions may involve a 
constant error in that chance errors tend to increase the size of 
the coefficient more when N is small than when it is large 

2 The division by the measure of variability makes com 
parable coefficients computed with groups of varying degrees of 
dispersion Logically, the S D appears to be the best measure to 
employ because of its reliability and because it weights heavily 
the extreme cases thus equalizing the effect of such cases upon 
the deviations of the numerator However, mathematical proof 
on this point is wantmg 

3 Other things being equal the item coefficient formula 
weights heavily the equal division into response groups Thus, 
assuming a deviation of one between the values of Response I 
and Besponse 2, the numerator of the coefficient would vary as 


40 The Improiemenf of InUlhgence Testing 

follows for a total group of ten cases, nt and ti* being the 
frequencies of the respective response groups 

n* »* nomeratOT 


0 10 0 
log 

s S 16 

3 7a 

4 6 24 

5 5 25 

5 4 24 

7 3 21 

8 2 16 

0 19 

10 0 0 


The denominator remains the same in each case and hence the 
coeffleicat i aries in proportion with the above numerators Items 
containing responses that are neither too eas^ nor too difficult 
are conseqaentI> favored notably when only two types of re 
spouses are recorded. This characteristic may work perniciously 
in the case of items allowing for guessing, like true-false items 
The gue«sing error tends to equalize the size of the response 
groups and slight erratic deviations between these subgroups are 
consequently emphasized Moreover, in attempting to determine 
the relation between item goodness and the difficult of the item, 
as will he shown later, this characteristic of the coefficient tends 
to destroy the vabdity of the results found 
4 The size of the item coefficient is dependent upon the 
adequateness of the analysis of the significant types of responses 
Thus if two significantly different responses having consequent 
differing true response values are first treated as distinct and 
then are inclnded m the same category, the resultant coefficient 
■mil vary according as the response values •were combined or 
treated separately In comparing items then, a similar degree 
of careftdness of response analysis, consistent mth feasibility of 
scoring, ought to maintain for each item studied Standardize 
tion of pnncjples nndexlying item analysis would be helpful 
Disregarding effects due to chance, the size of the coefficient is 
not increased merely bj increasing the number of response cate- 
gories unless a more valid distraction among responses actuallv 
IS made Thus if a response type has a C S C Score value of 
58 and a frequency of 20 for example to separate the responses 



The Pro6?ein o/ ike ChoicB of ihe Best Hems 


41 


into two types with frequencies of say 15 and 5, or 10 and 10, — 
types having no true distinction, — will yield for each the same 
value, namely 58 provided the effects of chance variations are 
omitted The chance variations associated with the formation 
of types not truly significant tend to increase the size of the 
item coefBcient 

5 In the item coefBcient formula the difference between the 
response values play an important part Should these differ 
ences be expressed in whole numbers or in finer units ? "What is 
the effect on. the size of the coefficient of the degree of fineness 
used in determining differences t Theoretically, the coefficient 
ought in general to increase in size with the refinement of units 
because of the greater differentiation between some response 
values which otherwise might have been regarded as identical 
To discover the actual effects, the coefficients of the forty items 
of the American Council Completion Test were computed first, 
employing whole number differences between response values 
and second, using two place decimal nnits The coefficients by 
the first method were correlated with those by the second The 
coefficient o! correlation was foood to he 909 The mean of 
the crude unit difference coefficients is 065, that of the other is 
070 The indications are then that while the relative positions 
of the coefficients remain very nearly the same regardless of the 
refinement of units, this refinement tends to increase the co 
efficient 

6 The item coefficient formula calls for the multiplication of 
the difference between the pairs of response means by the prod 
net of the frequencies of the respective responses It vs sigmfi 
cant to note that the M values and the n values are not un 
related The contingency diagram of Figure 3 expresses this 
association, the U values having been transmuted into deviations 
from the general mean of the CSC Scores 

The vertical axis represents the difference between a particular 
response value in terms of the CSC Score and the mean CSC 
Score of the entire group while the horizontal axis indicates the 
frequency of the response The data on which the contingency 
diagram js built may be found m Table IV on page 17 At the 
lower left hand corner of the diagram of Figure 3 is indicated 
for example, that of the response values between 0 and 105 eiglit 
were associated with frequencies of from 1 through 21 and so on 



42 


The linprovement of Intelhgcnce Tesiing 


The line of the m&siiQum deviation for ^a^ous frequencies indi 
cites the \alues that items showing perfect diifcrentiation would 



I 2Z 43 ^ 35 WG 

2.1 4Z &3 84 lOS JZ6 


FioTOE 2 AssoGATioi. Brrwtxji TBvqfTa>cv aad JDctiahov Valub or 
BesfdssKs 

The homontal aTm udieates the IreqocDey ot response the Wtjcal the 
dcTialjen of the response ralue from the general mean The line at the 
r ght md cates the maximum deiiatioB ralue for the various response 
frequencies 

obtain The values along this line were determined as toUovra 
The CSC Scores for the entire group were listed in order of 
size from highest to lowest Then for a frequency of 20 the 
mean of the 20 highest t^as computed and so on The line rep 











The PrdbUm of the Choice of the Best Items 43 

resents the smoothed curve miming through the means such 
os these 

The relationship between the deviation and the frequency m 
the preceding diagram is iii\erse and curMlinear "Where the 
frequency is high, the deviation value tends to reach its max 
imum, but this maximum is low As the frequencies become 
smaller the deviation, values tend to distribute themselves more 
widely and the discrepancy between the actual and the maximal 
\alues becomes larger The diagram further illustrates how dif 
ferentiation is lessened when the same response is made by a 
large proportion of the entire group It may prove helpful also 
in the interpretation of the results of the attempt to locate the 
source of item coefficient unreliability, presented below 

EilPlRtCAU STtmr OF THE BELUHIUTT of the ITESI COEITICIENT 

■Wliatever are the defects mentioned above, they are hardly 
less apparent m one form or another m the usual coefficients of 
association or correlation The item coefficient has, however, one 
relatively peculiar defect It has, at least thus far, successfully 
thwarted attempts to determine through mathematics! treatment, 
a measure of its reliability An empirical treatment is therefore 
necessary The results of sudi a treatment on a small scale, to- 
gether with a theoretical consideration of the matter, are pre 
sented in the following pages 

The data employed consisted of 

1 The response validity values and the response frequencies 
for forty items of the American Council Completion Test as 
computed with 137 cases 

2 The Item validity coefficients based on the results with 
these 137 cases 

3 1 and 2 above, based on the first 68 cases of the entire group 

4 1 and 2 above based on the last 69 cases of the entire group 

These data are g.ivcn m part in Tables IV and V 

First, the coefficient of correlation was computed for the forty 
pairs of coefficients for each item that is, the coefficient based 
on the first 68 cases was correlated with that based on the last 
69 The correlation coefficient proved to be only 279 The 
P C of the coefficient is 099 

Second the question arose, ‘Does the rehabibty of the item 



44 


The Jmprotement of JntelUgence Te4t\ng 


coefDcient •vary with the sue of the coefBcientJ” In order to 
answer this question the difference between the first gronp and 
the second group coefficients for each item was correlated with 
the sue of tlie item coefficient based on the entire gronp The 
coeflicient was found to be 524 There is, then, an apparent 
tendenev for low coefficients to be more reliable in absolute 
terms This is to be expected from the fact that the smaller the 
item coefficient based on the entire group, the less is the possible 
range of difference between the item coefficients based on the 
halves of the group TVlien the difference between the first and 
second group coefficients was correlated with the sue of the co- 
efficient based on the first half group and not on the entire group 
the r fell to 300 The conclusion still remains although to a less 
striking degree, that with the data employed good items are 
leas relnblo in absolute terms than poor items The P E 's of 
the coefficients are respectively 078 and 097 
The thud approach involved an analvsis of tbe component 
parts of the item coefficient formula and a consideration of tbe 
reliability of the various parts Tbe most significant factors of 
the formula are the response frequencies and tbe response valnes 
Coefficients of correlation between the first and second groups 
for these factors were 

s valuM cf tbe first half vntb Ibe n valaes ot the see 

end half 070* 005 

II Talues of the first half with the SI raloes of the 
second half — 0®^ 

The most significant cause of unreliability seems to be contained 
in the response CSC Score values These values are largely 
dependent upon the reliability of tbe C S C Scores themselves 
Hence the results of this tentative study point especially to 
the need of detemimmg empirically the relationship between 
the reliability of tbe item coefficient and the cnterion score 
reliability 

In the fourth place does the reliability of the response value 
vary with the sue of the response group T Returning to the 
method of assigning CSC Score response values (described on 
pages 14 ff) it seems probable that the response values of high 
frequency responses would show a higher reliability than those 
of low fi^uency The results "with the American Council Com 
pletion Test were studied to determme whether the above sup 



The Problem of the Choice of the Best Hems 45 

position was substantiated m fact The difference between the 
response value with the first group and that with the second 
was emplojed as a measure of the reliabibty of the response 
\alue The response frequency is simply the number of students 
out of the entire group making the given resnonse In Pirmre 4 



8.00 


too 




3.00 

BEinQiaBiH 

4.00 

osiiqbiqH 

3.00 

IBBSQiaElH 

Z.OO 

1 II II II III! IM 




H mmaaEi iB 


1. ZZ4564S5<0fe. 

Zl 4Z^»354f05 + 

FiGDEE 4 Assocutio'i betwzem Fbbquzvot and Br.r.HBrLnr or 
Fesfosse 

The frequency of response is indicated on the horizontal axis reliability 
of response along the Tertieal axis 

IS plotted the frequency of occurrence of the differences between 
the first and the second group response values corresponding to 
the various sizes of the response group As the response fre 
quency increases the mean of the differences and their van 
ability decrease "With mcreasins frequencies then, the 




46 The Improvement of Intelligence Testing 

reliability of the response values becomes both greater and more 
constant Consequently, it would appear reasonable to say that 
the item coefficient may be made more reliable by increasing the 
number of cases employed m its computation 
Fifth, how does the reliability of the response value vary with 
the size of the deviation of the response group mean from the 
mean of the entire group 1 The contingency diagram of Figure 5 


EtaaniaEiniauiai 

wim QDQQIllSliSIEIQEll 

m uiQuimniuiuiuiuiuil 
rjp I SDElElDElElEIBIDl 
T filBBQBGQBDQl 
£ SllE]9n[E]E]EIIlQE]| 
n i 3B33J3BJ3inniEi\ 

ElBSSQISlIlElBEll 
13 glHBtaHHHHElHi 


0 . 

FiauBE S Assoctattow bptweek Dcviatiov Vaiih: aki> Beuabilitt oj 
ResPONSE 

The horizontal mIs indicates the deTiatbn of the response taloe from 
the general mean, the Tertical, the reliability of the response ralue 

presents the reliability as above plotted against the deviations of 
the response CSC Score value from the mean CSC Score of 
the entire group Host of the cases are contained in the lower 
regions of both scales, and hence the association is high at the 
lower extremity For the esses falling elsewhere, the association 
13 very small High deviations, then, in themselves do not indi 
cate the probability of high rebability 

PRACnCAIi EFFECTIVENESS OP THE irESf COEFFICIENT IN CBOOS 


In the light of the high unreliabili^ of the item coefficient as 
computed with the available data, it is evident that some proof 




The Prohlem of ihe Ounce of the Best Items 


47 


that the coefBcient realij selects the best items must be advanced 
before valid results may be found with the use of the coefficient 
concerning the factors associated with item goodness The items 
of the American Council Completion Test were divided twice 
into four groups of ten, first on the basis of the item validity 
coefficient as computed with the first 68 cases, and second, on 
the basis of the coefficient as computed with the last 69 cases 
The meaning of the “ten best items'* and the “ten worst itens" 
as used below is self evident The “first mediocre set” includes 
the items ranking sixteenth through thirty fifth m size of item 
coefficient The “second mediocre set" includes those ranking 
eleventh through fifteenth, and twenty sixth through thirtieth 
Scores according to both the author’s scoring method and the 
new scoring technique were computed for each set of items 
These scores were correlated with the C S C Scores to give the 
validity of the set The coefficients of correlation are presented 
in Table ZIV 

TABLE XTV 

Vaumtt CazmettSTS or Brts or Iteus or tbz AifuiciM CormciL 
CoitpumoK Test Iixostsattvo tee EmomiNBss or thz Itoi 
CoErncir>rr 


Items 


Ten best itetns 
Ten worst items 
First mediocre set 
Second mediocre set 
All items 


Old Scortug 1 New Scoii&s j 


Old Scorlag I IfewScoriRg 


J38* 073 
042 i: 082 


091:1; 080 

assit 080 


J91 069 

J09d: 074 
340 ±. 072 


The coefficients of correlation are computed m each case with 
the group other than that used m the determination of the item 
validity coefficients and the new scoring values In the ease of 
the selection based on the first 68 cases one of the mediocre sets 
proved somewhat higher than the best set in the sue of the 
validity coefficient of correlation based on the old scoring method 
scores In the case of the selection based on the last 69 cases, 
the ten best items yielded the lowest validity correlation for both 
the old and the new scoring method scores, the item coefficient 
being, then insufficiently effective 

The old scoring and the new scoring coefficients retain about 
the sanu. relative position for each set of items This fact elimi 





48 


The Improtement of Intelligence Testing 

nates the possibility of criticism of the nse of the new scoring 
validity coefficients in evalnatlo^ the clTectivencss of the co- 
efficient. 

The thirty items of the Hoback Test 1 were diiided into three 
sets of ten, according to the size of the item coefficient. In addi- 
tion, the best fire items were similarly selected. Scores were 
computed for these sets according to the new scoring method and 
were subsequently correlated with criterion scores, yielding the 
following \alidity coefficients: 

Ten beat Items — 020 i .031 

Ten meJiocre Items — 147 i 079 

Ten worst Items . , . . — 147 -t ,079 

Fire lieat items 027 it 031 

All Items — J4S=fc 079 

The scoring values and the item coefficients were based upon 
the first C8 cases The coefficients of correlation were computed 
with the last 69 cases 

The item coefficient appears to have been successful m select- 
ing the best five and, to a large extent, the best ten of the items, 
but has cot adequately dilfercntiated between the second and 
third item sets 

Similarly, a set of the best fen items and a set of mediocre 
items were selected out of the sixty items of the Thorndike 
Academic Information Test, llie I1G8 The mediocre set in- 
cluded those ranking twenty-sixth through thirty fifth in size of 
item coefficient The item coefficient docs not adequately distm 
guish between the best and the mediocre sets, the validity co- 
efficients of correlation being as follows* 

Kew Scoring Metbod Score mlb the Tea Best 

Items “ 138 n: 077 

]S^ew Scoring blethod Score with the Ten iXedioere 
Items . 109 077 

New Scoring ^letlind Score with All tha lleaB — Oil ± .078 

^ith the data employed the item coefficient has proved only 
moderatel} successful at best in differentiating between the 
effectite and the meffective items This may be due to several 
causes, the most significant of which are listed below 

First, the items studied, having already been carefully selected 
by expert psychologists, might be expected to have relatively 



The ProlUm of the Chow of the Best Items 


49 


restricted true differences m goodness, and hence further differ 
entiation within the restricted range is made ditBcult 
Second, the factors maVmg for the unreliability of the co 
efTicicnl, such as the unreliability of the criterion scores and the 
smallness of the groups with which the cocfBcicnts were com 
puted, tend to destroy the cffectuencss of the item coefBcient 
Third the selection of items bj means of the item lalidity 
coefficient necessarily disregards the intercorrelations among 
items A tentatue suggestion purporting to overcome this diffi 
cultj in part is made below 

Fourth, since the item coefficient was originally intended for 
a test (the SIcCall llulti mental) in which the responses might 
he regarded as haring degrees of value rather than as being 
quite entirely correct or incorrect ns is the case with most m 
teUigcnce test items, certain errors might baie resulted A 
modification of the item formula which allows for this situation 
IS discussed below 

The first and second causes of ineffectiveness mentioned above 
mar be partl> eliminated by changes in the original selection 
of Items for study and in the selection of subjects for study 
In connection with the third source of error, since it is entirely 
infeasible to compute tbe intercorrelations among items a less 
imohed substitute method is necessary The theory underlying 
tlie present tentative suggestion is that once items have been 
grouped together as measuring the same trait a rough approii 
mation of the aierage intercorrelations for each item might be 
obtTined bj correlating each item with the total score on all 
items Thus a ‘consistency" coefBcient would be computed for 
each item Paralleling the reasoning when true intercorrela 
tions are emplojed m estimating the value of an item it would 
become necessary to n eight inverselj the consistency and to 
neight directlj the validity of an item Thus in selecting be 
tween two items of identical validity the one having the lowest 
validity would be expected to be more effective when joined with 
other items to yield a test score The i alue of this suggestion 
and more exact directions as to its use can be shown only after 
much intensive study As an indication of this kind of trial 
and error research that seems necessary the following is noted 
It was first necessary to compnte the item consistency co 
efficients for the items of the American Council Completion Test 



50 


The Improt&meni of InUlUgence TesUng 

employing the total score on the subtest as the criterion Inci 
dcntallj , these coefficients are more reliable than the validity 
coefficients since the reliability of the criterion scores is consider 
nbl> higher than that of the CSC Scores, as indicated by a 
reliability coefficient of 797, computed in the usual manner, 
1 e , by correlating half test scores and estimatmg the whole test 
score reliability coefficient by means of the Spearman Brown 
formula The ten best items of the American Council Comple- 
tion Test were then selected on the basts of the item validity 
coefficient The validity of this set of items when rescored by 
the new scoring method is represented by a coefficient of Correia 
tion of 519, computed with the cases on which the new scoring 
values were based The coefficient was lowered to 460 when 
the items were selected so that they represented the worst ten 
in consistency of the best twenty m validity This one result is 
necessarily inconclusive , the final solution of the matter is not 
attempted in the present study 

MODIFICATION OP THE ITEil COEFFICIENT 

In connection with the fourth source of error indicated above, 
some evidence was found to lead to the conclusion that a modi 
fication of the item coefficient formnla would prove beneficial 
Often in the tests employed m the study it was found that the 
response which the author regards as incorrect would be found 
to yield a higher response value in terms of criterion scores than 
that which the author credits as correct. The coefficient as here 
tofore employed credits the differentiation between response 
values even where their direction is contrary to subjective logic 
'Whether logic should be upheld in the situation must be deter 
mined again through empirical results The ten best items of 
the Thorndike IIQS Test were selected, fiist, according to the 
usual coefficient, and, second according to a coefficient described 
below, which decreases in sue when the incorrect response value 
13 higher than the correct response according to the author The 
validity coefficients of correlation were 

Ten best items on basis of nsnol item coeffieient — J33 ± 077 

Ten best items on basis of modiiSed item eoefSeient 097 ± 077 

In tins instance, a marked improvement resulted from the use 
of the modified item coefficient 



The Problem of the Choice of ike Best Items 51 

The modified formula described on page 38 is written like the 
original 

(M.— M.) + fM,— M,) (w.Xn.) + (n,Xn.) etc 

8D N* 

but the Ml, Mt, Mj, etc , are no longer necessarily in the order 
of size from highest to lowest. "Where an adequate judgment 
supported by whatever objective evidence is available can be 
expected to indicate that a certain response is better than a sec 
ond response, then the M value of the response said to be worse 
IS subtracted from the M value of the response judged to be bet 
ter, even though the worse has a higher value in cntenon score 
terms than does the better Certain “ (M — M) (n X n) terms 
may then become negative The modified procedure operates on 
the assumption that the reversals of the response values from 
the logically expected order are errors which the item fails to 
avoid, and hence ought to lower the coefficient for the item 
Where there is doubt as to the order of the response values, then 
the previous procedure is followed, namely, that which places 
first the M value which was actually found to be highest m terms 
of the criterion scores, and so on The “(M — M) (nXn)” 
term for such cases is consequently always positive 

DETEBMINATIOK OF THE OBJECTIVB FACTOBS ASSOCIATED WITH 
ITEM GOODNESS 

Because of the low reliability and effectiveness of the item co 
efficient as employed with the present data, the study of the 
relation of certain objective measures to the measure of item 
validity can retain but small significance Such objective item 
analysis must await the proof of the effectiveness of the measure 
of item goodness The treatment of this phase of the study is 
consequently brief and of a tentative nature 

The item validity and consistency coefficients for several of the 
subtests are presented in Table XV Item numbers are listed 
vertically at the extreme left and in the ease of the last twenty 
items of the Thorndike IIG8 Test, at the right 

Figure 6 represents graphically the data of Table STV con 
cermng the validity coefficients of tbe various items The figure 
illustrates the significantly wide variability of the measured 
validity goodness of the items contained witbm a given subtest 



TABLE XV 

iTtM Vauditt CoEmciENTs TOR Bctbul Bcbtxsts, TTirn iH* Hilt Gtovr 
Iteu VAUiHTr AND THE Wnot* OxotTr Imt CossrsTEScr Cozm- 
CIESTS WB TOE AUtSlCAM CotTHCO. CoUFLEnON TeST 














The Problem o/ iht Chtnet of iht Best Items 


53 


It also indicates the difference m the average goodness and the 
variability of the goodness of the item coefBcients taken as a 
group for each subtest The standard deviations and the means 
of the item coefficients for each of the four subtests represented 
ate as follows 


Test 1 

S D 

Mean 

yo of Items 

No of Cases 
Emplored 

Amencaa Council 1 1 

26 595 

6625 

* 40 ' 

137 

Urown 3 , 

23 €95 

0218 

20 1 

100 

Roback 1 

29115 

0674 

30 I 

€8 

T 1 ornilike UGS 

38 875 

0573 

1 50 I 

100 



Piauxc 6 DiaTaauTJOv or tue Itesi VAUpnr CoEmetENW toe Setixio 


S t fBTE s r a 



(tf. 



The Problem of the Choice of the Best Items 


55 


measures o£ item validity and consistency The detailed account 
of the determination of these coefScicnts is gi\en on pages 38 and 
49 The measure of difficulty employed "was the percentage that 
the number mating the correct response according to the author’s 
scoring was of the total number making an attempt with the 
Item This method of calculating item difficulty eliminates some 
what the effect of the position of the item by omitting the group 
which apparently had insufficient time to reach the later items 
The contingency diagrams of Figure 7 indicate the interrela 
tionships The ^abdity measures are practically uncorrelated 
with the consistency or the difficulty measures The consistency 
and the difficulty measures appear to have a slight cumlinear 
relationship, the middle ranges of difficulty arc associated m 
general with low consisteni^ This last phenomenon may be 
due to tbe fact that the item coefficient tends to penalize extreme 
difficulties, as indicated on page 40 That tbe same relationship 
was not found in the case of validity with difficulty may be due 
to the greater unreliability of the validity coefficients (pages 10 
and 50} 



CHAPTER V 

THE ANALYSIS OF THE SURTLSTS 

The treatment of subtests as if thej were true elements of a 
ps^ cbologieal examination is tlefectiie, tbcoret/call^ , essentially 
because tlio component items are dissimilar in their power of 
differentiation and also, to a less certain extent, m their true 
cffectixc or “pajchologica!” content However, the analysis of 
euhtests treated as units has scieral values 
First, it IS an essential step in item anal^^is in tliat it aids m 
the adequate selection of items for intensne onal^-sis and m that 
it gnes the predictive value of the subtest, utilizing the aothor's 
scoring lc>, against which ma;* be compared the predictire 
^aIQcs of the aubtest according to other methods of scoring This 
use 18 illustrated and developed m the discussion in Chapters 
III and IV 

Second, It la 1 aluablc in itself as indicating roughly the degree 
to which factors such as rclnbilit>, dilbculty, form, and the like, 
are associated with the validity of various types of test items 
This second use is limited, *and for two reasons First, the items 
within any subtest varj so much as to make the crude summary 
of those items which is represented by the subtest score simply 
suggestive of the probable true relationships, and nothing more 
Second, the subtests, to a far greater extent than the items, are 
nnequated for irrelevant factors The results of the study of 
the subtests, which is the subject of this chapter, are partially 
invalidated by the limitations indicated above 

DETEBltlNATION OF THE afEASURSS EMFUITED 
The measure of vabdity of a subtest was determined by com 
puting the Pearson coefficient of correlation between the scores 
on that test and the CSC Scores The method of determining 
the criterion scores is explained ka Chapter II, pages 7 ff In 
the case of the Brown and Roback subtexts, the raw scores were 
transmuted into T scores This transmutation does not effect the 
56 



The Analysts of the Suhiests 


57 


relative positions of the scores and hence modifies the computed 
coefficients of correlation only to a negligible extent All but 
five of the thirtj eight snbtests which were administered yielded 
scores snfficiently v ariahle for the computation of the validity 
measure 

The xebabihtj measnre was determined by first dividing the 
snbtest into two equal or practically equal groups of items, 
second, snmmiTig the scorw on each half test, and, third, cor 
relating the score on one half of the t«t with that on the other 
The usual assumption was made that the values of the test are 
similarly matched The resultant coefficient gives, then, the reli 
ability or consistency of the half test Twelve of the suhtests did 
not yield coefficients, because they were either inadequate in 
differentiation or serial in nature 

The testing tune is used as an approximate measure of the 
number of minutes spent by the student in responding to the 
item stimulations Slight luaccuracies in this measure, such as 
that caused by the fact that the faster student may have spent 
onl} part of the tune m actual work or that due to the flexible 
time limits characteristic of the Brown tests, were of necessity 
neglected 

Difficulty was determined by dividing the mean score on a 
test by the possible masimum score on that test 

The speed of item response gives for each test the time spent 
per Item, taken on the average 

The form of the test items refers simply to the number of 
responses that may he attempted Thus Type A includes two 
or three choice test items. Type B, four through eight choice 
items , Type C, practically unlimited choice items , and Type D, 
unclassifiable items 

The so called content types are as follows 

1 Language comprehension 

2 Language manipulation 

3 Language comprehension and use — completion 

4 Language analogies 

5 Language opposites 

6 Judgment and reasoning 

7 Information 

8 Numerical or algebraic manipulation 

9 Special material 



58 


The Improvtinent cf Jnfelhffenec Testing 


TABLK Xn 

PaTA KurtOTtD IK Tilt AWAtTSIB Of TUt fiCBTESTS 












The Analysts of the Suhtests 
TABLE XVI {Continvtd) 


59 


Subtest 1 

Val Coeff 
and P £ 

Ret Gttf 
andr E 

Time 


1 

B 

H 

- j 

Part ir 


1 







B 

3 MissiDg Parts 

056 

051 



3 

62 

30 

D 


4 Picture Analogies 

5 Geom. Figure 

189 

049 



3 

65 

37 ^ 

B 

H 

Asa! ] 

143 

050 


1 

3 

86 

30 

C 1 

9 

6 Algebra 

213 

049 



10 

75 

167 

C 1 

8 

V Aleeh. iafomatiOQ 

027 

051 



4 

13 

40 

c i 

7 

5 Gen. Aead Infor 

099 

050 

132 

oso 

13 

£9 

22 

A 

7 

Parts n and TTT 









1 

1 Reading 

2 Language Coinple- 

1 lea 

050 

371 

044 

36 

43 


0 

1 

tlOB 

1 148 

050 

536 

030 

23 

£6 

117 

0 

3 


B£SULTS 

Table SVT presents for the subtests studied all the available 
measures of validity, reliabibty, difficulty, time per item form, 
and content Tune is gis en m terms o£ minutes 
Tbs loTvness of the validity coeSeteots as compared TVith those 
usually reported may be due m part to the unreliability of the 
criterion scores m part to the fact that an unusually large num 
her of students engage in outside vrork, and in part to the re 
stricted variability of the group 
In studying the relationship between validity reliability, and 
time, taken in pairs, rank method coefficients of correlation were 
computed utilizing the twenty five complete sets of measures of 
these funetvona The v values, (transmuted from rho) are 


Validity with Beliability 
Validity with Testing Tinw 
Reliability with Testing Tan® 


403 ± 113 

— 147 ± 132 

— 284 ± 134 


Although factors other than time are probably the causes of 
the in\erse variation, the indications are that the tune devoted 
to a test IS not a very significant factor in causing high validity 
or high rebabiljty In general, the more reliable test is the more 
vabd test, but it is difficult to say whether that is due to the 
transference of reliabibty to vabdity or to the fact that a given 






CO The Improtemenl of Inttlhgcncc Testing 

test author has selected tests according to a standard that js 
equally high, relatnclj, for both 
■\Vhcn the cfTcct of testing time is held constant the coefficient 
of correlation of %alidit\ tiilh reliabiht} is found, by the use 
of partial correlations to be 381 
Difficnltj and time per item xrcrecach correlated with validity, 
yielding the following product moment coefficients 

\4l dily with D ffeal!/ — 104 N 33 

Validity with Tunc p« Itm 001 i I**! }» 23 

In general the less difficult the test the more valid it proved 
The mean validitv coefficient for the suhtests grouped accord 
ing to tjpc of test form arc as follows 

Ttte McAN\AUMnfCoEmcn3fT No orTrsra 
A i« a 

n 133 6 

0 135 IS 

D 15^ C 

The multiplicity of choice as an index of test form shows no 
significant association with the taliditj of the subtesfs 
The association between test content as analyzed and test 
validity IS more striking as indicated by the following means 
of tbe subtest >alidity coefficients for the various content types 
VuNVauocrr Ao.of 

Ttpe CormatsT R4 ntc Tests 

1 176 2 3 

2 107 84 

3 JIO 13 

4 135 6 3 

5 145 3 5 3 

6 130 7 6 

7 072 » 3 

8 145 3-5 4 

B 141 5 4 

These results are conditioned by errors of classification by 
the unreliability of the item coefficients by the operations of 
irrelevant’ factors such as test author length of test and so 
on and by chance influences It is interesting to note however 
that certain usual findings such as the superiority of the Ian 
guage completion type and the inferiority of the information 
type are substantiated (See page 57 for key to types ) 



CHAPTER VI 

SUJDtAKY AND CONCLUSIONS 

1 The purpose of the present study js to ascertain whether 
college entrance intelligence tests may he unproved by the use, 
in connection with the selection and scoring of test items, of 
certain relatively objective statistical devices 

2 The relatively objective new scoring method employed 
bases the value of a response to an item stimulation upon a 
meisure representative of the College Success Crilenon Scores 
earned by those students who have made a given response The 
conclusions numbered 3 through 9 are based on the use of the 
mean as this representative measure 

3 Test constructors usually assign a single scoring value to 
a certain type of response auch as the subjectively determined 
‘'correct” response, whereas, when evaluated in terms of the 
College Success Criterion Scores, any given response type shows, 
for the various items, a wide distribution of values 

4 "Within any one item, various types of response, such as 
an incorrect attempt and an omission, are very often assigned 
the identical scoring value, whereas the more objective measure 
here employed usually indicates different values for the different 
types of response 

5 In general, to omit a response to an item stunnlation is an 
indication of higher College Success Criterion Score than it is 
to mahe an incorrect attempt 

6 In the majority of the tests the few students who were out 
standingly slow in their test reactions proved, m general, to be 
markedly superior in College Snecess Criterion Score 

7 The possibility of correcting the test scoring key by means 
of objective item response analysis is illustrated in the case of 
responses which the author's scoring key regarded as wrong, but 
which the scorer thought deserved credit, and some of which 
were made by students with high College Success Criterion 
Scores, on the average 


61 



C2 The hnproiement of JnteU%gen<:e Testing 

8 The empirical comparuon of the old and the new scoring 
method indicates results which with two tests point slightly 
fa\orahlj to the new lechniqne, with two others, slightly against 
It, and with two others markedly against it 

9 The indications are that the new scoring method failed to 
produce an> significant improvement, essentially because of the 
lovmess of the original correlation between the tests and the 
criterion scores on which the new scoring values were based and 
because of the di&similaritj of the group employed m dctcrmin 
ing the new scoring values with the group with which the values 
were tried in the rcseoring of the tests The new technique 
must for practical purposes prove its worth with groups as dis 
similar as those here employed, but it ought to receive a trial 
with tests showing higher validit> coeflicients than those used 
here This implies ezpcnmcntatioo in a field where criterion 
scores that arc highly reliable can be found 

10 A tentative investigation to determine whether some value 
other than the mean ought not to be employed as represeating 
the College Success Criterion Scores associated with a given item 
response, revealed for a single set of items, the following order 
of merit of the various measures, from best to poorest the 
median, the upper quartile, the mean, and the proportion of a 
superior group making a given response Each of these measures 
for the same set of items, proved inferior to the old scoring 
method of the author 

11 There is evidence to show that an approach toward the 
uniform weighting of the items of a set, achieved by limiting 
the distance between the lowest and the highest item response 
value to a given number of units is an improvement over the 
weighting of items in proportion to their vabdity differentiating 
power 

12 The analysis of the characteristics of the item coefficient 
invented bj JIcCall and modified bv others, indicates that the 
measure is theoretically sound where chance errors are mini 
mired Its one significant defect is the fact that an associated 
rebability measure cannot, apparently, be devised algebraically 

13 The rebabibty of the coefficient as computed empirically 
by correlating item coefficients determined with the same items 
but with two different groups is represented by a coefficient of 
correlation as low as ZJ9 



Summary attd Canchmons 


63 


14 The higher item coeffieienta prove no more reliable than 
the lower ones, when the reliability of the coefBcient is measured 
in terms of the absolnte difference between the first group and 
the second group item coefficients 

15 There is evidence to show that the unreliability of the co 
efficient is due more to the unreliability of the response values 
than to the nnreliability of the response frequencies Since these 
values are largely dependent upon the CoDege Success Criterion 
Scores, the original source of the item coefficient unreliability 
may be traced in the last analysis to the unrebahility of the 
criterion scores 

16 "While high response frequencies are in general associated 
with high reliability of response value, the size of the response 
value is unassociated with response value reliability 

17 The item coefficient is only moderately successful at best 
in selecting sets of items that are the ten best, the ten poorest, 
and so on 

18 A single tentative trial of the use, m selecting the best 
Items, of a “ consistency ’ ’ item coefficient along with the validity 
Item coefficient, failed to improve the selection 

19 The writer’s modification of the item coefficient, as de 
scribed, and as employed with a single test, resulted in a marked 
improvement in the selection of the ten best items 

20 The item validity goodness, as measured by the item co 
efficient, shows a significantly wide variability for the items con 
tamed in any given subtest 

21 The variability and the mean of the measures of item 
goodness vary for the several subtests studied 

22 The measure of item vahdity goodness is practically un 
correlated with either that of item consistency or that of item 
difficulty 

23 The consistency and difficult measures show a slight 
curvilinear relationship, the middle ranges of difficulty being 
associated with the highest consistency, while the extremities in 
difficulty are associated, in general, with low consistency 

24 The coefficient of correlation (rho transmuted into r) 
between the validity and the reliability coefficients for twenty 
five subtests proved to be 403 "When the time of the tests is 
held constant (by means of partial r’s) the coefficient of cor 
relation of validity with reliability is slightly lowered 



APPENDIX I 


SUGGESTIONS FOR DECREASING THE LABOR ASSO- 
CIATED WITH ITEM ANALYSIS 

The use of esTtsia loTms itDd pibcedoTss xaaj deereass etnuRderabl/ the 
tune a&d laboT eonaetted scith the anaijsM of Steffis. The foUowuig are 
BuggestioBs 'Rlueh have grown ont of th« writer *b expenesee with thu type 
of work. 

1 lo the aeleetion of sxnibols for the Tarioos types of responaea, the 
test anthoi’e credited respotiaea shoold be isdiested bj eorrespoBdug 
numerals, where possible, responses assigned a raloe of zero ahoold be 
represented bj other syrnbota Thie will faeUitata the deternusstioa of the 
author 'e scores 

8 'Where the sconng u not too complicated, the tsbnlatioa of the resposaa 
symbols should take place simultaneously with the scoring of the tests 
With the use of the tabulation form illustrated on page 12, thia will entail 
no great hardship 

2 The eonputatioss inrolred in the detennination of the new response 
values, when the mean of the cnterion scores of the response group is 
Used, may be made most econoaucally oa follows 

(1) Transmute the eriteriou scores of the entire group into plus and 
minua deviations from an assumed mean of the scores 

(2) Compute the eum of the plus and the sum of the nunns deviations 
of these scores. 

(2) Add the sums algebraically 

(4) On a narrow strip of paper, place next to the number represeating 
each student, his criterion score denation, in one color, if plus or tero, in 
another, if minus The stndent numbera and criterion score deriatious 
should be placed an the slip so ca to correspond with the tabulation of the 
respoDse synibola illustrated on page 18 

(5) To tabulate the cnterion acorea associated with each response, tbs 
criterion score deviation slip should then be placed immediately adjoining 
the tabulation of the response aymbola for the item to be studied 

(G) The most frequent response type should be determined throngh 
inspection and omitted from the tabnlatioa of associated criterion seoiw. 

(7) Add algebraically the cnterion score deviations for each of the 
responses tabulated. 

(8) Divide the results of (7) by the respective response frequencies to 
obtain the response value indicated as the deviation in terms of criterion 
eeore units of each of the response group means from the assumed mean 



66 The Improtemtnl of InttlUgenee TesUng 

of tho entire pronp The tiJno of the most frequent respo6« irill not •* 
yet hire been (]etcnnlBe>! 

(9) Join into one nifrebraie enm the multj of (T), eompnled for e*eb 
of the tebolated reepomet. 

(10) Change the aign of the total aum of (0) 

(11) Ai]<] algetraJealiy the reralt of (3) with that of (10) 

(12) DItIJa the rnuit of (11) by (he freqaency of the omitted re- 
•ponee gronp to ghe the deelatton of that reeponfo (troop mean from the 
tutvmed mean of the entire group 

(13) Diride the reanlt of (3) by the freqiieneT of the entire groop 
Thli will gire the onit correction Indicating the ditTerenee between the 
true and the a>fume<] nenoi 

(It) 8 q) tract from eaeli rceponeo group reanit indicated in (S) and 
(IS) the Dnit correction of (13) lliis will yield the ralnea of the 
Tnrlooa reaponiea eipreeaed ac deriationi in lermi of college toecesa eri 
terion aeore nnlU, of the retponae groop meant from the true mean of the 
entire group 

4. In rcseoring respontca and in computing the Item eeeffieieBt, ilnee 
dUrtaneea between the retpenM raloea la dtilred, it ia sot Bceetaary to 
compute itepf (13) and (It) of the abore analrait 

5 In reaeeriog the reepentea, it wai found beat to repeat the tabnlatiena 
illuatrated on page 12, merety aobatltuling the new raluea for the old 

6 In reaeorlog reaponaea, it Is helpful to consider the moat frequent 
TcspooM at aero, and to determine the others areordiog to the differeoees 
of eaeb from the most frequent reepoaee ralue 

7 The modification of the aeoruig method la helpful in two eaaca first 
where a rery extreme ratuo Is determined with a very amall group, and 
second where a atudent falls to attempt a good number of tbs later items, 
presumably beeaoae of laek of time To avoid the errors and diflleulUcs 
inherent in these two situations, extreme values earned by three or less 
etudents were redueod according to a set scale, end where a atudent failed 
to attempt a number of Items at tho end of the test, the lowest value of 
the responses of each item was assigned In the case of each item omitted 



APPENDIX n 


inSCELLANEODS SUPPLEMENTARY RESULTS 


I. Total teoTes foi eaelt of tlko taamuiationa ireTe obtaiood Eitani&ing 
tbe seoTCS «n the eerusl eahteeta eontai&ed m each, no attempt bemg made 
to weight the Tanone eobtest ecorea. The total examination scores were 
correlated with the CoUega Success Cntenon Scores to Tield tho following 
Pearson coefficients* 


Exuonatiov Vau Coep. 


Brown Umrersity .. ... .265 

Eoback Jll 

Thorndike 235 

Thnrstone IT J176 


PJS No oTCisza Gsodp 
.062 100 B 

057 137 0 

.048 175 A 

.070 61 D 


The 1> Cronp was more TSiiable than Vbe other groups, and hence a small 
downward correction of its ^aliditj coefficient it seceasar; The lowneta 
of the eocficieata as compared with those fonnd elsewhere is doe in part 
to the factors isdieated on page 59 

2. The raliditj- coefficient of the Thorndike Examination rises from S35 
to 290 when the score of the QOS test, that is, the tme-falae seadenuc 
information test, is omitted from the composite 
3 The Beading Test items of the Thorndike Examination fall readily 
into four diTisions The reliability of the quarter tests was determined 
by eompntmg the Pearson coeffioent of correlation between each possible 
pair of quarter test scores, and obtaining the mean coefficient associated 
with each quarter test. The results follow 

CoaazLAnov Brrwzzif QnaarEs nsrs 

nia nih mia mth 


xria A63 .317 .291 

nih .303 J78 jrp 

nil a ... ..... 317 178 076 

IHlh .271 a79 076 

Mean . ^7 frCO 190 J75 


Portions of material apparently highly similar show ugaifieant differ 
ences in reliabUity The results fUiistrate the need for the careful nse and 
interpretation of the reliability eoeffieieat. 

4 The floetuation of the resolts of the comparison of rations seonng 
methods when lew eases are emplc^ed, was bronght to light by the eompn 
67 



68 Tht Improvtmtnt of IntelUgenco Ttsixng 

tatioa of ToIlJitj coeflletenU firvt wltli 3S euca, thea wlUi 40, and flaaU/ 
with tho eombioed 7S. Tho rmiti foliow: 


Booaiwa iwc Scuttton 

FnarSS 

SccovfitO 

COUBtKZD75 

Old Seorint;— all Itcma 

J32 

.121 

.200 

New SeoriDt;— ^11 itema .... 

— J3fl 

.14S 

—.011 

New Beorinc — best 10 In ra- 

liditx eoeflltlent 

— .lOS 

.014 

—.138 

New Scorinj;'— 'beat 10 la 

osodlflcd Tab eoef. ....... 

.035 

.205 

.007 


The tnt omplojed wu the Thoradike nos The new koHb^ ralaes 
were detemUed on i eeparate groop of 100 itodeota. 1( la apparent that 
the meaiarcd ralne of the eereral inatances of aeorin; and aeleetloa 
nethoda rarlea eonalderab)/ for the two amalier ^onpa. 



APPENDIX m 

BIBLIOQRAPHT 


AS£L6ov, HAitoLD II (’25) * * Pajehological Teats Verans Higb Bcliool 
Maiis in Povei ot I'^edietlne College Snceess." (Uapubliahed tbesu, 
Teaeitrs College, Colambia UnneTSity ) 
iLKsessox, J E isa epissca, L. T "The Ptedietwe Valoe of the Tale 
Clasaifieatlon Teats " Sehoot and Soc%ftfi, VoL S4, p 305 
Bailor, E hL (*24) Content and Farm in Tertt of Intetligenee Teachers 
College, Columbia CnlTCrsit/ 

Baomr, Wu 21 (’24) "A Stud/ of the Predietire Valna of Certain 

£lnda of Scores in lateUigenee Tests '* /ovr of Edae Ftyef^, ToL IS, 
p 44S 

CHAWiiH, J CaoSBT AVft Dale, A Barbara. ( ’22) "A Purtber Critenon 
for tb« Selection of Mental Test Elesenta" Jour of EJue Psyeh, 
Vohia.p 2«7 

PoSTEB, B. R. Avo Been, 0 M. (’27) "On Correction for Chinee in 
Multiple Besposse Teats ’’ Jour of Edue Erych , VoL 18, p 48 
Qasrett, Hrmt E Stattffiot tn FtycAolofy and Edueatton. hoogmasa, 
Creen and Compan/ 

Gates, A. I. (*23) "The Correlition of AebieTeinent u School Subjects 
entb Intelligence Teats and Other 'Variables." Jovr of Edue. Ftych , 
Vol. 13, p 223 

Gates, A I asp IiaSaut,, J (*24) "Eelaliee PredietlTe Values of Cer 
tau Intelligence and Educational Teats Together with a Stud/ of Edu 
cational Achierement upon InteUigenco Test Scores " Jovr of Edw 
Peych , VoL 15, p 517 

Geteb, Dettox ly (’23) "A Umfonn Objective Examination of In 
teTbgenee Testing ’’ Jour of Edue Psych, Tol 14 p 378 
Hisaiiso, JoHW P (*05) "The Nature of Intelligence •' Jour of Edue 
Prych., 'Vot 16 p 605 

Holztsgeb, Carl J (’24) " On Sconng Multiple Response Tests ’ ' Jour 
of Edw PsycA, VoL 15, p 445 

JoEPO’C, A M. (’23) "The Validation of lateUigenee Tests ” Jour of 
Edtic Fsyeh., VoL 14 pp 349, 414 

Eeixet, TstrsLAjr ( ’23) Slat*siteat Method The hlacmillaa Company 
liAms, IlosAiD A. (’84) "A Note on the Bbortewng of F.xaTninations.** 
dour of Edaa Peyeft , VoJ 15, p 116 
Mat, a (’23) ‘ Predietiag Academic Success ’’ Jour of Bdve 

Ftyeh. Vol 14, p 429 „ 

McCall, IViLLiAjr A Pou? to Experiment in Educatuw The liaemiiian 
Company 



70 


The Improvement of Intelhgence Testing 

McCall, "WiLLiAir A A>ni ms Stotentb (’26) “Constwetion of tie 
Mnlti mental Seale.” Teaehert CoUege Seeord VoL 27^ p 334 
McCau^ WiLLUll A. AND ms Stodests. (’25) "The Mnltimenta! 

Seale ” Teachers College Seeord, VoL 17, p 109 
^fn.tXR, Geobot P (’25) “Pomnlas of Sconng Tests in TVludi tie 
Maximum Amonnt of Cianee u Determined.” Jour of E3ue PsyeK, 
VoL 16, p 304, 

OoDEN, Bobest M, (’25) “The iiatOTO of Intelligence ” Jour of Educ. 
Ftych , Vol, 16, p, 361. 

OaixANs, Jacob 8. (’26) A Studg of the hatwe of EtffieuUg Teachers 
College, Colombia TJnivemtp 

Ons, A. 8. statistical ifetAod ui EduoationaX ileanrement World Book 
Co , Tonkere, N T 

Ons, A. 8 Directione Ott» Corrclatioa Chart World Book Cbmpan^, 
1922 , 

Peassov, Kasl. Tahlee for StatMt%nane ond Eu>netne%ant, Part I, Dkt- 
TCTSitj College, London. 1924 

Pjntnzs, B. (’26) "An Eropmeal Tiew of Intelligeaee.” Jour of 
Bdue. Ergeh , VoL 17, p 668 

PctTVB, B (’26) “Aecoiaep u Seorug Gronp Zntelligesee Testa” 
Jour of Zdue Peyeh , V©L 17, p 470 
PlNTNxa, B. (’23) lKtea\g«nee TeeUng Henrx Bolt and Companp 
Brcxv ^ (’^) ’“Hie DistrihiitiOB of Intelligeoee among College 

Students ” Jour of Edue Pijfeh , Vol. 16, p 124 
Been AND Eoeto (’23) “Power xa. Speed in Army Alpha.” Jovr of 
Edue Peyeh., VoL 14, p 193 

Bccb, G M. and De GaATr, M. E “Corrections for Cianee and ’Gness’ 
ra ’Do not Goess’ Instroctions m Multiple Besponse Tests.” Jour of 
Edue Fsyeh. VoL 17, p 363 

KecH, O M AND STODDAan G D “Comparati?e Beliability of Fire Types 
of Objectire Bzominatioa ’’ Jour of Edue pjyeJL, Vol. 16 p 89 
Btgo, Haboid O (’17) StatiSttettlMethodsApplyedtoEdueatxon. Hongh 
ton htifflin Company 

SnrovDs PEscttau (’"6) “Tarialions of tie Prodoet MomeTit (Pear 
son) Coefficient of Correlation.” Jovr of Edue, FeyeKr Vol. 17, p 458 
Sfifpostuir (*21) “Intelligence and Ita Measurement ’’ Jour of Edue. 
Pryeh , VoL 12 Xoa 3 and 4. 

TnoRNDHCE, E Zu (’14) Edueoliouai i^ehotogy VoL IIL rcaeiers 
College, Columbia Dniree^ty. 

Tbobndqce, £. L. ( '25) “The Improrcments of Mental MeasnremeirtA" 

Jour of Edue. Eeteareh, VoL 11, Mou 3. 

Thukstosi, Ii. L. (’25) “The Psychological Test Program.” The 
Edue. Eeeord VoL 7, No 2 

Toops, IlEBBEaT A. (’26) “The Status of Unirertity Zntelbgcnce Tests 
in 1923 "4 ' ’ Jo»ir of Edue. FtyeX, Vol, IT, p. 23. 

1 INCEXT, Leona (’24) A Study of Intetlignee Tert Elements Teaeiers 
College, Columbia DMreriity 



Appendix 


71 


TTEiDEatANN, Chas. W. ( ’26) Sov to Construct the True-False Examna 
fwK Teachers College, Colombia TToiTersitj', 

WiLsoK, Wu. B. (’24) “Information as a Measure of Intelligence and 
MatBiity.’’ Jour, of Edue. Psj)ch., ToL 15, p. S09. 

I^OOD, Bejt D. Heasurement in Bigher Education, World Book Company, 

Toikers, N. T. 

Wood, E. P. (’27) “ImproTiag the Validi^ of Collegiate Achierement 
Tests.” Jour, of Edue. PjyeA., VoL 18, No. 1. 

Tcle, O. TJ. ( *22) An Introduction to the Theory of Statistics Chas. 
GrifGn and Company, Ltd , London, England. 



